DQCOHEHT llSDMl 



TM 006 16S 

Hafflbletony Bonald K*| Ind Others 
Developoents in Latent Irait Iheory^ i le?ie¥ of 
Models^ Technical Issues^ and Ipplicationa, 
Ikpr 77] 

126p,i Paper presented at a joint meeting of the 
National Council on Measurefflent in Iflucation and the 
American iducational Eesearch Association (New York, 
New York, April 1977) 

EF-$0,B3 HC-$7^35 Plus Postage, 

Ability; Bayesian Statisticaj *Cognitive Measurementi 
Computer Programs i Criterion Referenced lestsi 
Goodness of Fimi *Mathematical Models; ^Measureaent ; 
probability; lesponse Style (TeatsJi Test Biasi Test 
Construction i ^Testing; Testing Problems i Test 
Interpretation I Test Items 

Item Characteristic Curve Theoryi *liatent Trait 
Theoryi Maximum Likelihood Sstimationi Tailored 
Testing 



Latent trait theory supposes thax, in testing 
situations, examinee performance on a test can be predicted (or 
explained) by defining examinee characteristics^ referred to as 
traits, estimating scores for examinees on these traits and using the 
scores to predict or explain test performance (Lord and Novick, 
1^68). In view of the breakthroughs in several testing problem areas 
brought about by the use of latent trait theory^^ it is clear that the 
field of latent trait theory will become increasingly more important 
to measurement specialists and test practitioners. Thir paper 
comprehensively reviews this field and addresses four matters. First, 
the nature and characteristics of latent trait theory ara introduced. 
Second, a review of many of the technical developments in the field 
is provided* Third, several promising applications of latent trait 
models are described, Pinally, some additional areas for research and 
development are suggested, (HC) 



ID 137 361* 

lUTHOE 
TITLE^ 

PUB ^ATE 
MOTE 

BDHS PEICl 
DISCIIPTOBS 

IDmTlFlEES 
ABSTIACT 



Sfil * ^ 3jf ijE * * * * * iil * 3^ 

* Documents acguired by EEIC include many informal unpublished * 

* materials not available from other sources, BBIC makes every effort * 

* to obtain the best copy available. Nevertheless^ items of marginal * 

* reproducibility are often encountered and this affects the guality * 

* of the microfiche and hardcopy reproductions IHIC makes available * 

* via the EEIC document Eeproduction Service (EDES) , IDES is not * 

* responsible_ for the guality of the original document. Eeproductions * 

* supplied by EDES are the best that can be made from the original, * 



3/29/77 



Developments In Latent Trait Theory^ A Review of 
Models, Teohnieai Issass, and Applications^*^ 

\^onald K. Hcff^bl^ton^ Hmn-hm*an Bwcuirmathan^ Linda L. Cook 
Daniel Eignov^ JcDitae 4, Oifford 

University of Ma&saQkuBattB^ Archers t 

There are many well-documented shortcomings of scandard testing and 
measurement technology »^ For one^ the values of standard item parameters 
(item difficulty and item discrimination) are not invariont across 
groups of examinees that differ in ability. This meanH chat standard 
Item statistics are only useful in test construction for examinee popu-- 
latlonM very similar to the sainnle of eKamlnees in v?hich the item stat- 
istLcs were obtained. There are many testing situations where invariant 
item parameters would be highly desirable. Another shortcoming of 
standard testing technology is that comparisons of PKaminees on an ability 
measured by a set of test items comprising a test are limited to situa- 
tions where examinees are administered the same (or parnllel) test items. 
While must common standardiEed achievement, and aptitude tests are 

typically suitable for middle-ability studentSs these tests do not provide 
very precise estimates of ability for either high-^ or low-ability examinees. 
'^Tailored testing" Is designed to correct this shortcoming by administering 
test items to examinees that are carefully selected to "match" their 
ability levels (Lord, 1970b, 1974b; Weiss, 1976; Wood, 1973). In 
"tailored testing," it is likely that no two eKamlnees will take the same 

^A'pifer presented at a Joint meeting of NCME and AERA in New York, 
April 197^ 

^ La bo rni ury f J'^y^ cho metric an d Ky^i 1 (ui t i v u R ^ ^aa rc h Re po r t ^o_, 4 7 . 
Amfierst, Massachiist* t ts : School of Education, University of MassachuBut ts , 
1977/ 

^'Standard teHtin^ and measurement technology" refers to commonly 
used methods and technlcjueB for tt^Mt design and analysis. 



us OE^flSTMlNTQP MlALTM. 

NftflONAJ. INSTITUTE OF 
iPUCATION 

iHiS OOeuMENT HAS BiiN REPRO- 
DUCED ExACTLf RECEfveO FROM 
THE PERSON OR ORG^NJZ^tiCN QRiGiN- 
hJlH&n POiNTS view OB OPINIONS 
STiTEO DO NOT fiECElSARlUV REPRE* 
SENT OF PiClAL NATIONAL INSTITUTE 



sec of test Itiims (or even the same niimbur of test items). Since some 
exiiniLnees will be administared more difCicult sets of test Itcnis than 
other eKnminaes, the usual examinee test sc-ores (or proportion-correct 
yciores) do not provide an adequate basis for ranking examinees on the 
ability measured by the teat items in the "domain of test items" from 
which teat items were drawn. How then can examinees be compared? Cer-- 
tnlnly standard test models (Lord and Novicks 1968) cannot handle the 
problem* 

Another shortcoming of standard teating technology is that it 
provides no basis for decerfflining what a particular examinee might do 
when confronted with a test item* Such information is necessaryj for 
example, if a test designer desires to predict test score characterise- 
tics in one or more populations of examinees or to design tests with 
particular characteristics for certain populations of examinees. 

Besides the three shorccomings of standard testing technology 
mentioned above, standard testing technology has failed to provide 
satLsfnctory solutions to many testing problems (for example* test da- 
sign, test score equating, and item bias). For^these and other reasons, 
many psychometricians have been investigating and developing more appro-- 
priate theories of mental measurements. Consequently, considerable at- 
tention is being currently directed toward the field of latent trait 
t^eqrv, "sometimes referred to as item response theory or item character 
is ti c curve theory . Latent trait theory can be traced back 
to the work of Lawley (1943, 1944)* Lazarsfeld (1950) was perhaps the 
first to introduce the term ''latent traits," The 



3 



-3- 

work of Lord (1952, L953a, 1953b), however, is gGneraily regardad as the 
"birth'* of latent trait theory (or modern test cheory as Ic Is sometimes 
callt^d). Progress in the 1950* s and 60' s was painstakingly slow. In part 
due t ? the mathaniatical complexiny of the field, the lack of convenient 
and efficient computer programs to analy,^.e the data according to latent 
trait theory, and the general skepticism about the gains that might nccrue 
from this particular line of research. However, important breakthroughs 
recently in problem areas such as test score equa ting (Lord, 1975a; 
Ren tz and Bashaw, 1975), tailored testing (Lord, 1974b; Weiss, 1976), 
test design and test evaluation (Wright* 1968) through applications of 
latent trait theory, have attracted considerable Interest from measiirenient 
specialists* Other factors that have contributed to the current interest 
in laiiient trait theory include the availability of a number of useful 
computer programs, publication uf a variety of successful applications 

in measurement jQ'urnals (Bock, 1972; Lord, 1968, 1974b, 1975d; Samejima, 
1969,. 1972; l^ltely .& Dawis, 1974; Wright & Panchapakesan , 1969), and 
the strong endorsement of the field by authors of the last thrae reviews 
of test theory in the Annual Review of Psychology (Keats, 1967; Bock & 
Wood, 1971; Lumsden, 1976). Anoi^her important stimulant of interest 
in the field was the publication of Lord and Novick^s Statistical The o ries 
of Mental Test S^org s. They devoted five chapters (four of them written 
by Allen Birnbaum) tO' the topic of latent trail theory., A testimony to 
the current interest and popularity of the topic is the fact that the 
Jnurna l nf Educationnl Measuref-^ient will publish six invited papers on latent 
trait theory and applications In the summer issue of 1977. 

4 . 



■ -4- ■ 
What is latent trait theory? A theory of latent traits supposes 
that^ In testing situacions, examinea parformance on a test can be pre^ 
dieted (or eKplalned) by defining examinee characteristics ^ referred 
CO as traits , estimating scores for exarainees on these traits and using 
the scores to predict or explain test perfortriiance (Lord and Novlck^ 1968) • 
Since the traits are not directly measurable j they are referred to s.b 
latent trait s or abilities * A latent trait model specif iee a relationship 
between the observable examinee test perforTTiance and the unobservable 
traits or abilities assumed to underlie performance on the test. The 
relationship between the "'observable" and the "unobservable"'V quantities 
is described by a mathematical function . For this reason, latent trait 
models are mathematical models . These mathematioal models are based on 
specific assumptions about the test data. When selecting a particular 
latent trait model to apply to one's test data, it is necessary to con- 
sider whether the data satisfy the assumptions of the model* If they do 
not, different test models should be conslderad. Alternately, some psycho-' 
metricians (for eKample, Wright 4 1968) have recommended that test developers 
design their tests so as to satisfy the assumptions of the particular 
latent trait model they are interested in using. In this way, the advantages 
of the particular latent trait model of interest can be utilized * 

In view of the breakthroughs in several testing problem areas 
brought about by the use of latent trait theory, it is clear that the 
field of latent trait theory will become increasingly more important to 
measurement specialists and test practitioners. Therefore* given the 
newness of the field, its rapid growth In recent years, and the diversity 



5 



-5- 

of views and cotit rlfautions . it seenis apparent that a comprehensive review 
of the field is in order. 

This docuraenc addresses four maccera; First, the nature and char- 
acteristics of lacent trait theory are introduced. Second, n review 
of marty of the technical developniQiita in the field is provided. Third, 
several promising applications of latent trait models are described* 
Finally, some additional areas for research and developfnent are suggested. 



c 



.-6- 

Latent Trait Theory 
Dimension ality of the l atent apace; local independencQ , and 
Item cha racteristic curves are three important noClons chat arise in 
connection wich latent trait theory. These three notionSi along with 
a discussion of the ability scale, will be provided next* 

Dttnensionallty of the Latent Spac:e 

In a general theory of latent traits, it is assuTned that a set of 
k latent traits or abilities underlie examinee performance on a set of 
test items. The k latent traits can be used to define a k dimensional 
latent space, with each examinee's location in the latent space deterTnined 
by the examinee-s position on each latent trait* The number of dimensinns 
of the latent ipace depends on the number of abilities 
measured by the test in the population of eKaminees the test is admin- 
istered to. The latent space is referred to as complete if all latent 
traits Influencing the teat ocores of a population of examinees have been 
specified^ ' 

It is commonly aRsumed that only one ability is necessary to "cKplaln," 
or "account'' for examinee test performance. Latent trait models that 
assume a single latent ability Is sufficient to explain or account for 
examinee performance are referred to as unidlmenslonal , Those models, 
that assume that more than a single ability is necessary to adequ/itely 
account for examinee test performance, are referred to as mu ltl. d Ime ne 1 on a 1 . 
The reader is referred to the work of Mulalk (1972) and Samejima (1974) 
for discussions of multidimensional latent trait models. 



-1- 



Tho <.,su»,ptlo„ Of . u„l<,l„.„=Io„al latent spac. Is a co™„„ 
=e.t co„sttuctors. Since c.e. „s.aU, a.si.e to construct unl.l„.„si„„al 
tests so as to enhance th. lnterp„tabiUty of a set of test scoras 
(U-aen, ,^at does it ^ean to sa. that a test is „„i.i.e„si„„an 

^..ppose a test consisting of n ite„s is intended f„. „se in r suhpopula- 
"ons of oKanilnees (e.g., several ethnic groups). Consider next the 
.ondttional distributions of tost score, at a particular .MUt. l.vel 
for the r suhpopulations. These conditional distributions for the r 

subpopulatlons will be identical iF .-h^ ,= 

aencxcal if the test is unidlmenslonal . If 

conditional distributions very across the r subpopulations. it 
can only ba becaus.-the test Is measuring something oth.r than the 
single ability. Hence, the tast cannot be unidl.ensional. . 

It is possible for a test to be unidimenslonal within one popular 
Uon of examinees ond not unidimenslonal in another. Consider a test 
With a heavy cultural loading. Thl. test could appear to be unidi.ensional 
for all populations with the same cultural background. - However, when 
administered to populations with varied cultural backgrounds, it may in 
fact have mora than a single dimension undsrlying the test score. 
Examples of this situation are seen when the factor structure of a 
Particular .et of test items varies from one cultural g„up to another. 

Lumsden (1961) provided an excellent review of methods for con=- 
structlng unidlmenslonal tests. He concluded that the method of factor 
analysis held the most promise. Fifteen years later ha reaffirmed his 
conviction (Lumsden, 1976). Essentially, Lumsden recommends that a 



8 



tesc constructor generate an initial pool of test items selected on the 
basis of ompirical evidence and a priori grounds* Such an I tern sul ec t i(ui 
procedure will Incrense the likelihood chat ^ unidimensional set of 
Cesc items within Che pool of items can be found* If test icems are not 
preselected, the pool may be too heterogeneous for the unidimensional 
set of items in the item pool to emerge* In Lumsden's method, a factor 
analysis is performed and items not measuring the dominant factor ob^ 
tained in the factor solution are removed. The remaining items are 
factor analyzed, and again* "deviant" items are removed. The process is 
repeated until a satisfactory solution, is obtained* Convergence is most 
likely when the initial item pool is carefully selected to include only 
items that appear to be measuring a comrnon trait. Lumsden proposed that 
the ratio of first factor variance to second factor variance be used 
as an "index of unidimensionality , " 

Factor analysis can also be used Co check the reasonableness of 
the assumption of unidimensionality with a set of test items (Hambleton 
& Trnub, 1973). However, the approach is not without problems. For 
example, much has been written about the merits of using tetrachorlc 
correlations or phi correlations (McDonald & Ahlawat, 1974)* The common 
belief Is that using phi correlations will lead to a factor solution 
with too many factors, some of them "difficulty factors" found because 
of the range of item difficulties among the items in the pool, McDonald 
and Ahlawat (1974) concluded that "difficulty factors" are unlikely if 
the range of item difficulties is not extreme and the items are not too 
highly discriminating. 



-9- 

Tetrachoric correlations have one attractive fea- 
ture. A sufficient: condition for the unidimensionality of a 
set of items is that the macrix of tetrachoric item intercorrela- 

tions has only one common factor (Lord & Novlck, 1968), On the negative 
side, the condition is not necessary, Tetrachoric correlations are awk- 
ward to calculate (trie formula is complex and requires some numerical 
Integration), and, in addition, do not necessarily yield a correlation 
matrix that Is positive definite, a problem when factor analysis Is ad- 
tempted. 

Local Independence 

The assumption of local independence states that the probability of 
an examinee answering a test item correctly Is not affected by his or her 
performance on any other Item In the test. 

If we let Ug, g - 1, 2, n, represent the binary responses 

(1, if correct; 0, if Incorrect) of an eKamlnee to a set of n teat items, 
Pg ^ the probability of a correct answer by the examinee to Item g, and 
Qg ^ 1 - Pg, then the assumption of local independence leads to the following 
statement* 



= n P S Q 
8=1 8 s 



That Is, the probability of an examinee response pattern is given by the 
product of probabilities of the Item responses. 

One result of the assumption of local independence is that the fre^ 
quuncy of ti»HL Hcnrus across cxamlneus for fJxed nbillty, denoted 0, Is 
given by 

10 



ERIC 



Hk\b) ^ I n v h Q ^"^s, ; ■ [2] 

lug-x g-l ^ K 

where x is an eKamlnee's test score which can take on values from 0 to n. 

The assumption of local independence for the case when 9 is unidlmensional, 
and the assumption of a unidiTnenaional latent space are equivalent • First, suppose 
a set of test items measure a common ability. Then, for examinees at a fixed ability 
level 6, item responses are statistically independent* For fixed ability 
level 6, if items were not statistically independent, it would Imply that 
some examinees have higher expected test scores than other examinees of the 
same ability level. Consequently, more than one ability would be necessary 
to account for examinee test performance. This is a clear violation of 
the original assumption that the items were unidlmensional , Second, the 
assumption of local independence implies that item responses are statis- 
tically independent foe examinees at a fixed ability level* Therefore, 
only one ability is necessary to account for the relationship among a 
set of teat items. 

It is important to note that the assumption of local independence 
does not Imply that test items are uncorrelnted over the total group of 
eKaminees (Lord & Novlck, 1968, p. 361)* Positive correlations between 
pairs of items will result whenever there is variation among the exattiinees 
on the ability measured by the test items. 



11 



Becautfu of the equivalence between the assumptlonsof local independence 
and of the unidimcnsionality of the latent space, the extent to which a 
set of test items satisfy the assumpcion/of local Independence can also 
be studied using factor analytic techniques. Also, a rough check on the 
statistical independence of item responses for examinees at the same 
abilicy level was offered by Lord C1953a) . His suggestion was to con- 
sider examinee item responses for examinees within a narrow range of 
ability. For each pair of items, a statistic can be calculated to 
provide a measure of the independence of item responses. If the pro- 
portion of examinees obtaining each response pattern (00, 01, 10, 11) 
can he ''predicted" from the marginals for the group of examinees, the item 

responses on the two items are statistically independent. The value 6f 
2 

the X statistic can be computed for each pair of items, summed, and 
tested for significance. The process would be repeated for examinees 
located in different regions of the ability continuum. 

Te st and Item Characteristic Curves 

The frequency distribution of test scores for a fixed 
level of e can be obtained using Equation [2] defined in the previous 
section. ■ The curve connecting the means of these distributions represents 
the regression of test scores on ability B. If the teiris unidimensional , 
this curve Is referred to as a test characteristic curve for test character- 
istic function if the latent space is multidimensional). 

It is also possible to develop it em characteristic curves in a 
Hlmilnr immnur. Tiie nrequcncy distribution of a binary Item score for 



.12 



:■ -12- . : ' 

fixed ability 6 can be written ■ ... 

i,e,V fgCugle) 1 Eg ; ; 

' : s Qg if Ug - 0, 

The curve connecting the means of the conditional diBtrlbutlons, repre^ 
sented by Equation [3], is the regression of item acore on ability and 
is ^referred to as an item characteristic curve (or Item characteristic 
function if the latent ability space is multidimensional). An item 
characteristic curve Is a matheTnatlcal function that relates the prob- 
ability of success on an item to the ability measured by the Item set or 
test that contains it/ In simple terms. It is the non-ilnear regression 
function of item scor-^ on the latent trait measured by the test* 

the complete latent space is defined for the examinee populations 
of interest, the conditional distributions of item scores for flKed 
ability level must be Identical across these populations. If the condi- 
tional distributions are Identicalj then the curves connecting the means 
of these distributions must be Identical; i.e, , the item character-^ 

is tic curve will remain invariant across populations of examinees for 
which the complete latent space has been defined. Since the 
probability of an individual examinee providing .a correct answer to an 
item depends only on the form of the item characteristic curve, it Is 
independent of the distribution of examinee ability in the population of 
examinees of interest. Thus, the probability of a correct response to 
an item by an eKaminee will not depend on how many other examinees are 



13 



located at the same ability level. In other words, the shape of an item 
characteristic curve does not depend on the distribution of ability in 
the examinee population* This Invarlance property of item character-- 
Istic curves and consequently the parameters describing the curves is one 
of the attractive characteristics of latent trait models. The in- 
variance of latent trait item parameters has Iniportant implications for 
tailored testing, item banking, study of item biaSj and other appli- 

cations of latent trait models. 

It is common to interpret Pg(9) as the probability of an examinee 
answering item g correctly. Lord (1974b) questioned this interpretation 
and provided an example to show that this common Interpretation of P (0) 
leads- to an awkward situation. Consider two examinees, a and b, and two 
Items^ i and j. Suppose examinee a knows the answer to Item i and does 
not know the answer to item j . Consider the situation to be reversed 
for examinee b. Then, ^^CB^) - 1, Pj(e^) ^ 0, P^(e^) - 0, Pj(9^) - 1. 
The first two equations suggest that item 1 Is easier than item j * The 
other two equations suggest the reverse conclusion* One interpretation 
is that item 1 and j measure different abilities for the two examinees. 
Of course, this would make it impossible to compare the two students. 
One reasonable solution to the dilemma Is to define the meaning of 
PgO) dif ferentlyV Lord suggests that Pg(e) be interpreted as the 
probability of a correct response for the examinee across test items 
with near identical item parameters. 

Each Item characteristic curve for a particular latent trait model 
Is a member of a family of curves of the same general forriu The number 



14 



' of parameters required to describe an item characteristic curve will 
depend on the particular latent trait model * It Is common, though, for 
the number of parameters to be one, two, or three* For example 5 the 
item characteristic curve of the latent linear model (Figure 1, c) 
has the general form P (0) ^ b + a^O, where (6) designates the prob-- 

& fee B 

ability of a correct response to item g by an examinee with ability level 6 
The function is described by two item parameters, item difficulty and 
item discrimination J denoted b and a respectively. An item character- 
Istlc curve Is defined completely when its general form is specified and / 
when the parameters of the curve for a particular item are known* 
Item characteristic curves of the latent linear model will vary 
in their Intercepts (b ) and slopes (a^) to reflect the fact that the 
test Items vary In ■■difficulty" and ■■discriminating power. " 

Item characteristic curves for Guttman's perfect scale model are 
shown in Figure 1 (a). These curves take the shape of step functions. 
Probabilities of correct responses are either 0 or 1, The critical ability 
level is the point on the ability scale where probabilities change 
from 0 to 1. Different items lead to different values of 6** When 6^ 
is high we have a difficult item, and when 0* is low, an easy item. 
Figure 1 (b) describes a variation on Guttman's "perfect scale" model. 
Item char!acterlstlc curves take the shape of step functions but the 
probabilities of Incorrect and correct responses, in general, differ 
from 0 to 1. Figures 1 (d), (e) , and (f) show "R" shaped curves repre- 
senting logistic models, respectively* With the one-parameter logistic 
model, the item characteristic curves are non^intersecting curves that 



15 



Figyre L Sevsn exahiples of item chQrQcfiriitio curves. 



Pg(0) 



0 



Hem 



to 2 



(a) perfect scale curves 



I Ism 



Hem 2 



0 



U4 



onT diSianca curves 



i 

H 







^ .... 

-ERIC 



-17- ' 



; differ only by a translation along the S scale. We say that Items with 
such characteristic curves vary only in their difficulty. With the 
Ewo-parameter loglatlc modpl , item characteristic curves vary in both 
slope (some curves Increase more rapidly than othersj I.e., the cor- = 
responding test items are more discriminating than others) and 
translation along the ability scale (some items are more difficult 
than others). Finally, with s;he three-parameter logistic model , 
curves may differ In slope, translation, and lower asymptote. With the 
one- and two-parameter logistic curves, the probabilities of correct 
responses range form 0 to 1. In the three-parameter model, Che lower 
asymptote, in general, is greater than 0. When guessing is a factor in 
test performance, this feature of the item characteristic curve can im- 
prove the "fit" between the test data and the model. In other models 
^^^^ "ontlnai response model and the graded response model) ' there are item 
option characteristic curves . A curve depicting the probability of an 
ltem_ option being selected as a function of ability is produced for "each 
option or choice in the: test item. An example of this situation is shown 
In Figure 1 (g) . 

It is most common for a user to specify the mathematical form of 
the item characteristic curves before beginning his or her work. It is not 
easy to check on the appropriateness of the choice because item character- 
istic curves represent the regression of Item scores on a variable (ability) 
that is not directly measurable. About the only way the as,sumptlon can be 
checked is to study the "validity" of the predictions with the Item 
characteriscic curves (HambletonSTraub, 1973; Rosa & Lumsden, 1968). 
More will be said about how to make these predictions later in the paper. 



20 



The Ability Scale 

/: If we were to administer two tests^ that measured the same abilicy, to 

■= the same group of examinees, and one test was more difficult than the 

other, we would obtain two different test score distributions. The ex- ' 
tent of the differences between the two distributions would depend, 
among other things, on the difference between the difficulties of the two 
tests. Unfortunately, there Is no basis for preferring one distribution 
over the other. What this eKample reveals Is that, In general, the 
test score distribution provides no Information about the distribution 
of ability scores. 

The problem occurs because the raw--score units from each test are 
' unequal and different. On the other handj the scale on which ability 

scores are measured is one on which eKamlnees will have the same ability 
score across non-parallel tests measuring a common ability. Thus^ even 
though an eKamlnee's test scores* will vary across non^-parallel forms of 
a test measuring an abllicy, the eKpected ability for an 
eKaminee will be the same on each form. 

Most measurement specialists are familiar with the concept of true 
score , the expected test score for an examinea* What is the relationship 
between true scores and ability scores? Lord and Novick (1968) showed 
that the test characteristic curve^ introduced earlier^ provides the rela- 
tlonship. This is easily seen from the following argument. Consider the 
: proportion-correct sdor^ ^ " w * Then 

..: E(2|0) - i S P (e), [4] 

var (ale) = ^2 pgCs) QgCe). [s] 

ERIC 



-19- ■ ■ : ^ 

E(2|e) is the test characteristic curve (scaled by. 1/n) introduced earlier. 
It is. the aum of item characteristic curves for items included in the 
test. Suppose next we lengthen the test by adding an infinite number of 
parallel-forms. By definition, ECi|e) - T, the true score. Also Var (g|e)^ 
as n and so T and 8 will be related by a tnonotonic increasing trans^ 

formation which is the teat characteristic curve. Clearly then, the two 
concepts, T and 9, are the same, except for the scale of measurement used 
to describe each. One important difference is that true scori^ is deHrTe^^^^ 
on the interval [O s n] whereas ability scores are defined on the interval 

There are other differences between true score and ability score. 
True score is defined for a particular test. It is the expected 
test score for an examinee, An eKamlnee's true score will vary across 
non-parallel measures of the same ability.. On the other hand, ability 
score ls_de fined for a "pool!' "universe" of items measuring a single 
ability* An examinee's true score in different samples of items would 
(In general) vary. Howevers ability score Is defined in terms of the 
"pool" of items from which the sample was drara. Latent trait models 
specify relationships between examinee item performance and ability, and 
so it Is always possible to "transform" examinee performance on a parti- 
cular sample of items (defining a test) onto an ability scale defined for 
the larger "pool" of test Items. Thus, while an examinee would have (in 
general) a different true score for each sample of items drawn from the 
pool and would obtain different test scores in each sample of items, the 
DKpected estimate of examinee ability from each sample of test items 
would be the same. » 

. 22 



• ■ -20- 

Ability scores can be used with Item characteristic curve para- 
meters for items included in a test to estimate examinee test perfor- 
roance. Recall, 

E(X|e) « 1 P (6) . . - [6] 

■ ■ g«i ^ 

Thus, ability scores provide a basis for contgnt -referenced InterpretatlQns 
of axaminee teat scores. When the quantities In Equation [6] are scaled 
by 1/n, E(X/n|e) representsthe expected proportion of Items in a -test 
that an examinee will answer correctly and.'this Interprotation will Have 
meaning regardless of the teat parformanee of other eKaminees. Of course;, 
ability scores provide f basis for norm-referenced Interpretations as well. 

Let us consider next how the metric for the ability sc .ie is chosen. 
It is chosen so that the Item characteristic curves have some specified 
mathematical form. On the basis of eKamlnee test performance, examinees 
can be ordered_on_ability.,^^_^ these abilities on 

the ability scale are chosen so as to maKimize a criterion reflecting 
agreement between examinee item response data, predictions of the 
test data derived from the "best-fitting'' Item characteristic curves and 
optimally positioned ability scores on the ability scale. HowaverV \ 
the origin ^nd unit of measurement of the ability scale "are arbitrary . 
Any linear transformation of the ability scores Is permissible* Also/ 
it has been suggested that when an externa! criterion measure with mean- 
ingful units can be locatedj a transformation be found to - 
transform ability scores to this new scale. Such a transformation would 
enhance the interpretablllty of ability scores. 



23 



-21- ■: ■: 

... Lord (1975d) reported one rather distressing property of the 

ability scale observed in his work. Item parameters defined on thi 
ability scale were found to be correlated in sIk sets of empirical data 
that he studied. Lord proposed a monotonic transformation of the abllltv- 
scale to correct the problem. With the availaWlity of computer pro-- 
grams, this operation could be routinely performed. 



24 



. -22- . . . ■.. . ..: 

Latent Trait Models 

The purpose of this section Is to introduce several of the most com- 
monly used latent trait models ^ The normal'-oglve model, the one--, two-, 
and three-parameter logistic test models , the graded-response ttiodel , 
the nominal response model, and the continuous response model* All models 
assuine chat the princ_iple of local independence applies and (equivalently) 
that the items in the test being fitted by a model measure a common ability 
A significant distinction among the models is in the mathematical form 
taken by the item characteristic curves. A second Important distinction 
among the models is the scoring. 

Additional latent trait models are=dlscussed by Lazarsf eld and Henry 
(1968), Lord and Novick (1968) , and Torgerson (1958)* Deterministic^ 
models (for example, Guttman's perfect 'scale model) are of no interest to 
us here because they are not likely to fit most achievement and aptitude 
test data very well> Common test items rarely discriminate 
well enough to be fit by a deterministic model (Lord, 1974b). 

(a) Normal^Ogive Model 

Lord (1952 5 1953b) proposed , a latent trait model (although he was not 

the first psychometrician to do so) in which an Item characteristic curve 

takes the form of the normal ogive i 

; a (e-b ) ' 

Pg(e)> J * (C) dt, (g - l,2,,..,n) [7] 

where ^g(9) the probability that an examinee with ability 9, answers 
item g correctly, ^(t) is the normal density function, and b and a are 
parameters characterizing Item g. The parameter b is usually referred 



25 



' \ ' -'-23- ■■; . 

to as the index of Item difficulty : It represents the point on the ability 
scale at which on examinee has a 50% probability of answering the item 
correctly. The parameter agV called Item discrimination , is prnpnrMnnrtl 

to the slope of P^(0) at the point G ^ b . 

S 8 ' ^ 

The item difficulty parameter, bg, is-def Ined on the Siime' scale as 
ability [-^,4^]. In practice though, the range of is from about ^2 
to +2 (assuming the ability distribution has been scaled- to be approxi-^: 
mately on the range from ^3 to +3). Values of b^ near ^2 correspond to 
items that are very easy and values of bg near +2 correspond to Items that 
^^are very difficult for the group of examinees. 

The Item discrimination parameter, ag. Is defined, theoretically, 
on Che scale l-^,^] . However, negatively discriminating items are dis-- 
carded from ability tests. Also, it Is unusual to obtain a_ values larger 
than two. Hence, the usual range for item discrimination parameters is 
[0, 2], High values of ag result in Item characteristic curves that are . 
very "steep." Low values of ag lead to item characteristic curves that 
increase gradually as a function of ability* 

(b) Two-Parameter Logistic Model 

Blrnbaum (1968) proposed a latent trait model in which the item 
characterlstYc^curve takes the form of a two^parameter logistic distri- 
butlon function, 

: Dag ce-bg) 

h^^^ ^ ' Dag(e^bJ (g - l>2,.,.,n). [8] 

1+e , - - 

Blrnbaum substituted the two^parameter logistic cumulative distribution 
function for the normal-ogive function as the form of the item 



-24- 

characteristic curve* This model has the Important advantage of being 
more mathematically tractable; than the normal ogive modal* P (8), b , 

and f) have essentially the same interpretation as in the normnl ogive 
model. The constant D is a scaling factor. It has been shown that when 
D * 1.7, values of PgC^) the normal ogive and two^parameter logistic 

models differ absolutely by less than *0l for all values of 6 (Haley, 1952). 

Careful Inspection of the two^parameter normal ogive and logistic; 
test models reveals an additional implicit assumption that is character-^ 
istic of most latent trait models i Guessing does not occur. This must 

be so since for all items with a >0 (that is, items for which there is a 

g 

positive relationship between performance on the test item and the ability 
measured by the test), the probability of a correct response to the item 
decreases to zero as abr'lity decreases, 

(c) T hree;- Par am e^ter Logistic Model 

The three-^parametor model can be obtained from the two^parameter 
model by adding a third parameter, denoted Cg. The mathematical form bf 
the three-parameter logistic curve is written >v/^'' 

^PagO'-bg) 

i+e°^gCe-bg) 



PgCe) - Cg + (1-c^) ^ (g - l,2,.,,,n). [9] 



The parameter Cg is the lower asymptote of the item characteristic curve 
and represents the probability of low ability examinees correctly answering 
an item* The purpose of Including a parameter Cg in the model is to attenip- 
to account for the misfit of item characteristic curves at the low end of the 
ability continuum, whern among other things, guessing is a factor in test 
performance* It has been common to refer to the parameter Cg as the 
guessing parameter in the model, " 



27 



-25- 

It is perhaps surprising to note that the parameter Cg typically 
assumesvalues that are smaller than the value that would result if ex- 
aminees of low ability were to guess randomly on the item. As Lord (1974a) 
has noted, this slcuation can probably be attributed to the ingenuity of 
item writers in developing "attraccive" but incorrect choices. For this 
reason, avoidance of the label "guessing paraTOeter" to describe the para- 
meter Cg would seem to be desirable. 

(d) One-Parameter Logistic Model (Ras^h Model) 

In the last decade, many researchers have become aware of work in the 
area of latent trait models by Giorg Rasch, a Danish mathematician (Rasch, 
1966), both through his own publications and the papers of others advancing 
his work (Anderson, Kearney, and Everett, 1968; Wright, 1968, 1977a, 1977b; 
Wright and Panchapakesan , 1969). Although the Rasch model was developed indep 
...iluntly,.M.J3Uhexaatent . trait-.models^and-fll(,„g=^qu ite -d if feren t H ncs --RTrKcli « s - 
model can be viewed as a latent trait model in which the item characteristic 
curve is a one-parameter logistic function. Consequently, Rasch 's modc-l 
is a special case of Birnbaum's two=parameter loKlstic model. In whicli all 
items are assnmed to have eq „a ] d J scr in, i na L ( power and vary only In t.-rms 

of diFFicuJty. The equation of Ch^ Itun, vUnnun vr] hU , ,;urvi. Inr ihfH mndul 

caii bo wt It tun an 

'^^'^ ^T^^^ -1,2,. ..,n). [10] 

In whLt:h S, the only term not previounly defined, la the common level of 
discrimination for all the items. Wright (1977a) prefers to write tlie 
model with Dl incorporated into the O scale. Thus, the right-hand side 
of the probability stiitement becomes e^~^k 



28 



-26 



The assumption that all item discrimination parameters are equal is 
rescrictive, and substantial evidence Is available which suggests that 
unless test items are specifically chosen to have this characteristic, 
the assumption will be violated (e,g* Birnbaums 1968; Hambleton ^ Trnuh, 
1973; Lord, 1968; Ross, 1966), 

. While the Rasch model is a special case of the two^ and three^ 

parameter logistic test models , it does have some special propertiei' 
that make it especially attractive to users. For ones since the model 
involves fewer item parameters , it is easier to work with. Two, the 
problem of parameter estimation is essentially solved. This point will be 
discussed in a later section. 

There appears to be some misunderstanding of the ability scale for 
the Rasch model. Wright (1968) originally introduced the model this way: 
The__Qdds_ Jn_£avo.r_o.f__auccess_Qn_an_l^ re— given- by , t he - p roduG t - o f 

an examinee's ability 8^^* and^ the reciprocal of the difficulty of the item, 
1/bg . Odds for success will be higher for brighter students and/or easier 

items. The odds of success are defined as the ratio of P„. to l^-P 

gi gx 

where P^* is the probability of success by e^cnminee i on item g» 

... 

Therefore, 



e.* 

1 



p 




[11] 



or 



b 




[12J 



29 



Equation [10] can be obl;ained from Equation [12] by setting 8* - e^^^ and 

* D3b - - i - ^ 

bg «. e g , In Equation [12], both 9^- and hg are defined on the intei:^val 

[-?+^]* If log ability and log difficulties are considered, then 0 and 

bg, and log U and log are tneasured on the same scale, [^^^S"^'"], differing 

only by an expansion transformation. 

We return again to the point above regarding the. odds for succeBs on an 
Item. Clearly, there is an Indeterniinancy in the product of and 1/h 

When odds for success are chanied, we could attribute the change to either 
Oi* t)r For- example, if odds for success are doubled^ it could be 

because ability is doubled or because the item is half as difficult. There are 
several ways to remedy the problem* For one we could choose a special 
set or "standard set*' of test items, and scale the bg's, g ^ l,2,...,n so 



that bg ^ 1. Alternately, we could do the same sort of scaling f 



or a 



.'^standard'V set „ of .examine that the average. of Q^, i ,^^1,2, , , . jN is 

set to one* The final point is clear. When one itenr is twice as easy as 



30 



^28- 

another J a peraon-s odds for success on the easier item are twice what 
they are on the harder item. Tf Dne person's ability is twice 
as high as another- person's ability, the first person's odds for success 
are twice those of the second person (Wright » 1968). In what sense are 
item and ability parameters measured on a ratio scale? An examinee with 
twice the ability (as measured on the Rasch ability scale) of another 
examinee* has twice the odds of successfully answering a test Item, Also, 
when one item Is twice as easy as another item (again, as measured on the 
Rasch ability scale) ^ a person has twice . the odds of succassfully answer- 
ing the easier one. The other latent trait models do not permit this 
particular kind of interpretation of item and ability parameters. 

(e) Nominal Response Mo del 

The one^5 two-, and three-parameter logistic test models can only 
be applied to test items which are scored dichotomously* The nominal 
response model, introduced by Bock (1972) and Samejima (1972), is appliu- 
able when items are multlchbtomously scored* The purpose of the model is 
to maKimlEe the precision of obtained ability estimates by utilizing the 
information contained in each response (correct or incorrect) to an Itcin, 
This approach represents another method In the search for differential 
scoring weights that improve the reliability and validity of mental test 
scores (Wang and Stanley, 1970) . Each Item option is described by an 
Item option characteristic curve . Even the "omit" response can be repre- 
sented by a curve. For the correct response* the curve should be monotonlc 
ally increasing as a function of ability* For the incorrect options, the 



31 



-29- 

shapes uf tht* curves d^ipend on how the options are perceived by examinees 
at different ability levels. 

There are, of course, many choices for the mathematical form of the 
Uem option characceristlc curves (Samejima, 1972), For one, Bock (1972) 
assumed the probability chat an examinee with ability level 0 will select 
a particular item option k (from m available options per item) to item g 
Is given by 

hIl^'Sh+4« ^^^--^n; U ^ ia....,m). [13] 

For any ability level 6, the sum of the probabilities of selecting each of 
the m item options Is equal to one. The quantities b|^ and a*j^ are item 
parameters related to the k^^ Item option. When m^2, the items are 
dichocamously scored and the two-parameter Inglstic model and the nominal 
response model are identical, 

( f ) Graded Response Model 

This model was introduced by Samejima (1969) to handle the testing 
situation where item responses are made into two or more ordered categories 
For example, with test Items like those on the Raven's Progressive Matrices 
one may desire to score examinees on the basis of the correctness (for ex-- 
ample j incorrect * partially correct , correct) of their answers. Samejima 
(1969) assumed any response to an item g can be classified into m^ 4- 1 
categories, scored - 0, 1, m^, respectively. Samejima (1969) in- 

troduced thu operating characeeristic of a graded response category. She 
de f LnuB 1 1 as = 

32 



-30- 

Pj^^(0) is the regression of the binary item score on latent abilltyi when 
all the response categories less than x. are' scored 0 and those equal to 
or greater than are scored 1* f^^(0} represents the probability with 
which an examinee of ability level 6 receives a score of Xg. The mathe^ 
matical form of P^^ is specified by the user, Samejima (1969) has con^ 
sidered both the two^pararneter logistic and two*-parameter norrnal-ogive curvas 
in her work- In several applications of the graded response model, it has 
been common to assume that discrimination parameters are equal for (0), 
Kg - Os 1^ ^'^f mg*. This model is referred to as the homogeneous case of 
the graded response model* Further, Samejinia defines P§(6) and i t (^0 

so that 



and 



P^Ce) - 1 [15] 



P^ (8) - 0 ^ . [16] 



(mg+l) 

Also, for any response category x^* 

The shape of Pj^_(6), x - 0 * 1* • • * m^, will in general be non=monotonle 
g a h 

. ^ . 

except when Xg ^ mg, and ^ 0* (This Is true as long as P^^(9) is muno^ 
tonically increasings for all Xg - 0* Is *-*s ^g*) 

(g) Contin uous Response Model 
' The continuous response model can be considered as a limiting case nf 
the graded response model* This model was introduced by Samejima (1973b) 
to handle the situation where examinee item responses are marked on a con^ 
tinuous scale. The model is likely to be useful » for example^ to social 
psychologists interested in studying attitudes. 



33 



-31- 

Estimation of Parameters 
Once r.he various assumptions such as unidimensionality and local 
indapendence have been made regarding the latent variable^ and the form 
of the item characteristic curve is speclfledj the probleni of estimating 
the parameters of the latenC trait model arises. If, say the two para- 
meter normal ogive or the logistic models is deemed appropriate^ and if n 
items are administered to N examinees , the parameters that have to be 
estimated are the 2n item parameters pertaining to item difficulty 

and item discrimination » and the N parameters that correspond to the 
abilities of the examinees. C^ing to the large number of parameters which 
may result when a large number of examinees are involved, the estimation 
of parameters in latent trait models present substantial statistical and 
numerical problems. The statistical problems that arise in the estimation 
of parameters are related cothe nature and properties of the estimates. —The = 
numerical problemSj on the other hand* arise in connection with the solution 
of the estimation equations and are related to the convergence of the 
algorithms employed to solve the equations. 

The basic statistical problem associated with estimation of para^ 
meters in latent trait models arises when the item parameters have to be 
estimated simultaneously with the large number of ability parameteiSi In 
this situation, the item parameters are common to all the N observations 
and hence are called "structural parameters," The ability parameters ^ 
called "Incidental parameters *" on the other hand, are. specific to the Individ^ 
ual observations and hence increase with the number of observations. The 
problem of estimating structural parameters in the presence of incidental 
parameters has been studied by various authors, Neyman and Scott (194 8) 
and Kendall and Stuart (1973, pp. 62) have shown that the maxirnum likelihood 



34 



-32- 

estlmates of the structural parameters in the presence of incidental para^ 
meturs are not conslstpnt* More recenLly, Andersen (197^) has demonstrated 
that consistent maximum likelihood estimates of the structural or item 
parameters in a one^parameter latent trait model do not axist whan the* 
ability parameters and the item parameters are estiinated simultaneously • 

The estimation of the parameters of the latent trait models require 
the determination of the values of the parameters that maximize the like^ 
lihood function if maximum likelihood estimates are sought* The likelihood 
function, which will be defined a little latere is rather complex and Is a function 
of a large number of variables. The problem of finding the extreme values 
of a function of several variables is not trivial and often requires 
numerical methods* These numerical procedures are Iterative in nature ^ 
requiring some starting values for the parameters In question and these are 
then iterated upon until the sequence of values converges * Often ^ the con^ 
vergerice of the^ Sequence may be ratKer ' slbw7 or V if ~ the^ sequence clbes^con-"' 
verge, it may not converge to the true solution* A case In point is the 
three=parameter logistic model, Samejima C1973a) has shovm that the likeli-- - 
hood function for the estimation of ability parameters in a three^parameter 
logistic model (under the assumption that the item parameters are known) may 
not possess a unique maximum. In this case, since a unique maximum does 
not exist, depending on the starting value, the sequence tnay converge 
to a value that corresponds to a local maximum. Thus, the values 
of the parameters that maximiEe the likelihood function, or the estl-- 
mateSi will not be the true maximum likelihood estimates of the parameterH*. 
In the case of the three-parameter logistic model with known values for Uem 



35 



paramectTS, Samcjima (197^) has provided conditions under which the likeli- 
hood function possesses a unique maximum. However, when the item parameters 
nre not known and have to be eatimated, the likelihood function which is a 
function of the item parameters as well as the ability parameters, may not 
possess a unique maximum, and hence, the values of the parameters that maximize 
the likelihood function may not correspond to the true maximum likelihood 
estimates. 

Despite the statistical and numerical problems mentioned above, the 
literature in latent trait theory abounds with procedures for estimating the 
parameters that arise in latent trait models. These estimation procedifres, 
which have been developed over the past thirty years range from heuristic 
procedures such as those given by Urry (1974) and Jensema (1976) to conditional 
as well as unconditional maximum likelihood procedures (Andersen, 1970, 1972, ' 

1973a, 1973b; Bock, 1972; Lord, 1968, 1974b; Samejima, 1969; Wright and 
„ian?hapakesan ,^ 1969 J. :Wright and Douglas ,= 1977 )^ and empirical^ as^well as-true - ' - 

Bayeslan procedures (Blrnbaum, 1969; Meredith and Kearns, 1973; Owen, 1975). 

These procedures are discussed next, and althougb there are 

severe problems with the estimation of parameters in latent trait models, in 

some instances these problems can be overcome. 

Maximum Likelihood Estimation in Laten t Trait Mndels 

We assume that an examinee is administered n dichotoraously scored 
items and that the underlying latent space Is unidimensional. Let V be a 
vector of binary random variables such that 

V = [U, Uj... Ug... u^]. 

and, v, a particular realization of V such that 

V « [uj U2... Ug... u^]. 



36 



-34- 

The random variable Ug takes on the value Ug where Ug - 1 if the examinee 
responds correctly to the Item and ug = 0 otherwise. We also denote 
Pg(e) = Prob [Ug = ] je] 

and 

Qg(S) = l-Pg(e) - Prob [Ug = 0|S] . 

Hence, the frequency distribution of the binary item score, for fixed e 
cnn be written as 

fgCugle) = Prob [Ug =. ugl S] 

Pg(e) if Ug = 1 

Qg(e) if Ug * 0, 

Thus. Che conditional probability of a response vec-ror, V = v, for fixed 0 
can be expressed as 

Prob [V = vje] = Prob [U^ = uj^, U2 = Uj. • = xi^\0]. 

It Chen follows from the principle of local independence thnt 

Prob [V = vje] * R Prob fUo u„ |0] 

g=l B 8 

= n p (e)"8 Q„<e)i-»g . 

8=1 - * 

IE the n Items are administered to a group of N examinees, then the likeli- 
hood function or the Joint probability distribution of the response patterns 
for the N e:<aminees for fixed ability levels B^, Og, .... % , is given by 
Lb Prob iVi = ^1,^2 = ^2* • • • » V„ - v„|e] 

^ n u 1-u 

- n n p (6^) g Q (e^) g . [^gj 

k=i g=l ft to 

The function, PgO), the probability that an examinee with ability 0 
responds correctly to, item g, is the regression function of nny Item response 

' 21 



-35- 

on 0| and is more cotmnonly referred to as the item characteristic curve* 
The item characteristic curve is a function of the item parameters such as 
the indices of item difficulty and discrimination as well as the ability, 
0^, of the k examinee* Once the form of the item characteristic curve 
is specified, the maximum likelihood estimates of the item parameters 

and the ability parameters can be determined as those values that maximize 
the likelihood function L given by Equation [18] , 

The item characteristic curve , ^g^®) take one of several forms 

as mentioned earlier. We shall discuss the procedure for obtaining maKimum 
likelihood estimates only for the one-, two^, and three-^parameter logistic 
models as these are the models thar are frequently used. The one^parameter 
logistic model, better known as^the Rasch Model, has the item characteristic 
curve Pg(9)s given by Equation [10]. In this * ists Pg(^) is ^ function of 
the item difficulty parameter, bg, and tbe /jbillty parameter, 9, more 
appropriately deno t ed^jas J , c o r r e s p o a J t ng to che__ab 1 1 i t y ^ o f t he examln ee 



The two-parameter logistic model for which the item characteristic curve 
PgC9) given by Equation [8] is a function of ags the discriminating power 

of the Itemj b^, the item difficulty index, and 0j^, th£^ ability of the k^^ 

i 

eKamlnee. Similarly / the item characteristic curve Sol the three-parameter 
logistic model given by Equation [ 9] , in addition to being a function of 
the parameters ag, bg^ and 6^, is a function the parameter Cg * In 
general, if we let ¥(k) denote the function 

/ 4^(k) ^ exp x/(l + exp x), [19] 

then the item characteristic curve for tha one parameter model or the Rasch 
model Pig(0), i-^ given by 

The item characteristic curve for thvf two-parameter model is given 

38 



-36- 

while the item characterlj tic curve for the three-parameter model is given 
by • 

In order to obtain the maximum likelihood estimates of the parameters 
It is necessary to solve the likelihood equations 

31og L^/Sbg = 0. aiog Lj/3Bj^ = 0 . 8=l,....n;k=l....,N, [23] 
Eor the Rasch model, the equations 

81og L2/3ag = 0, 31og Lj/Bbj, = 0, log Lg/SBj^ = 0 [24] 
for the two-parameter logistic model, and the equations - - 

Hog Lj/Sag = 0, ;)log hj/Sbg = 0, Slog Lj/DOj. = 0, [251 
31og Lj/Bcg =0 

for the three-parameter logistic model. The likelihood functions L^, and 
L3 are obtained by substituting Pig(6), PjgCB) and Pj^Ce) respectively for 
_y0Uln Equation- [18]_ .Since-the -scale mnd orig^ la not fixed, we 
have to solve simultaneously n+N=2 equations for the one=parameter model, 
2n+N-2 equations for the two-parameter model, and 3n+N-2 equations for the 
three-parameter model. The exact form of these equations are not given here 
since they are well documented (Blrnbaum, 1968; Wright & Douglas, 1977, in press 
Solutions to the likelihood equations discussed above are, unfortunately, 
not available in closed form. Hence, numerical procedures have to be em- 
ployed to obtain the solutions of the likelihood equations. Procedures for 
solving these equations have been suggested by various writers. Blrnbaum 
(1968, pp. 422) suggests a hourlsticprocedure that involves specifying 
starting values for the parameters, substituting these In the likelihood 
equations and iterating until convargence takes place. Although this Is 
an appeallngly simple procedure, it is Inefficient and convergence to the 
true solution is not guaranteed. A more satisfactory procedure is the.-™-..- 



=■37- 

Newton-^Raphson procedure suggested by Bock and Liebertnann (1970) , Bock (1972), and 

Wrlghc and Douglas (197 7), In general, if the equations to be solved are 

of the form f(u) ^0 where « is a vector of unknowns, and (a) 

IB the matrix of the derivatives of f (^) with respect to the vector of 

parameters, Chen the (i+l)th approximation to the solution of the system 

^ C^) ^ 0* —1+1 ' given by 

-^i+i - lii" [i'c^^r^ [26] 

where Is the ith approximation to the solution of f(£) * 0* Thus, the 
, Newton-Raphson procedure, in this case, requires the evaluation of the 
matriK of second derivatives, or the Hessian, of the logarithni of the like^ 
lihood function with respect to the parameters. Although the Newton- 
Raphson procedure is more tedious than the simpler procedures , the conver- 
gene© of the Newton-Raphson algorithm is quadratic, at least in the 
neighborhood of the solution vector. In addition, the Hessian evaluated 
the- maKlmum of the-log-^likellhood function yields the"^ inverse "of ^th'e^^^^ 
asymptotic dispersion matrix of the maximum likelihood estimates (the 
expression for the asymptotic dispersion matrix is given in a later section). 
Alternatively, the Method of Scoring (Rao, 1965, p. 302) can be employed 
to solve the likelihood equations. The Method of Scoring is essentially 
the Newton-Raphson procedure, but employs the asymptotic dispersion matrix 
in place of the inverse of the Hessian in the iterative sequence* Although 
this procedure can be slow in convergence when compared to the Newton- 
Raphson procedure, it is computationally simpler,^ since the asymptotic 
dispersion matrix does not have to be updated at each iteration. In addition, 
the asymptotic dispersion matrix is positive definite while the Hessian may 
bcH:ome indefinite at some stages of the Iteration, a fact that causes 
convergence problems in some instances. 



v.: . .... ,, : ^ --38- . , . ... , ,; - ■ .. ■■. • . . = 

The maKlmum likelihood proceclure discussed abovo has been omployud 
for Che simultaneous eBtimation of item parameters and ability parameters by 
Hunh anthora as nirnhnum (1968), T.ord (1968), Wri^ln: /iiul PnncHinp/iknHnir ( 1 969) , 
.MmI Wrlj^hl nnd DnnivI/iH (19//), An c*Ki'rliiMil diHcnHHliin nj Hniiit' ai 
t^Ht imnt Ion probluniH tluit nru uncountcrtid in i>racLUH- lur I lu' LhruL'-piirnnicL t*r 
model is given by Lord (1968)V He points out that the iterative procedure 

fails to converge unleas.the number of. items and the number, of examinees 

IS large. If either n or N is small, the estimates of the item discriminatiun 
Indices may increase without bound. Wright (1977a) has suHi^ested a reason why 
this may happen by using an argument based on the traditional method of 
estimating item, discrimination. In order to estimate the item paranieters, 
' estimates of the abilities of the examinees are first obtained. Blrnbaum 
I (196H) has shown that n sufficient statistic for estimnting the abUity 
0, far the 1th examinee is given by Ij^j^^j' ^here u^^j is the score of the 

' --- - - 2 h . ■ ,, 

ich exantlnee on the jth item. Since ag ^ Q^/il-p^) (assuming guessing is 

minimal and ability is normally distributed) , where is the correlation - 

between 0 and Ug (Lord and Novick, 1968, p. 378), an Initial value for ag can 

' be obtained when Pg known and the assumptions arc 'net by the test data. 

The item scores are then weighted by these values to yield an estimate for 

the ability of an examinee. This procedure is iterated until a stable value 

of ag is obtained. However, during the iteration, the that was the largest 

gets larger until it dominates the weighted combination of Item scores, and in 

the next step, results in a that approaches unity, Lhus drivLnj^ tlu^ vnlutv 

of Hj,, !)eyond buund.^. Lnrd (1968) suggests imppHlng an uppur limit for /i^, 

based on the largest permissible correlation between the item and the ability. 

This suggestion which is not unlike that of Incorporating prior beliefs intci 

. the estimation procedure, produces estimates that are reasonable. \/ . 

Similar problems arise when estimating the ability of the eKamlneL* * 

Unlike the a^, infinite vnlues (positive or negative) are permissible ■ 

: for Oj^V and can be expected whenever an examlnee_pbtains a perfect score . 

; ^^; Ton the test or fails to score correGtly even on one Item. These Infinite . .: 



values can be avoided by using a bounded function of 8^ instead of 8- 
Itself, However, as the occurrence of infinite values for 9^^ itself 
causes no theoretical probletni. It is not necessary to compensate for this. 

Another problem, noted by Lord (1968), is that the entire iterative 
procedure may fall to converge or may converge extremely slowly. Lord (1968) 
employed the Method of Scoring for the solution of the likelihood 
equations/ Although the procedure converges quadratlcally in the neighbors- 
hood of the maximum, the convergence is rather slow when the starting 
values given are iar from the maximum. In some instances, poor starting 
values cause the procedure to diverge. One possible solution * obviously, 
is to provide good Initial values, or, employ a linear search procedure 
like the method of steepest ascent, and then switch over to the Method of 
Scoring (or the Newton^Raphson) when the linear procedure slows down in \ 
convergence* The linear search process does not seem to have been incor- 
poraced in the existing algorithms and its efficacy needs to be Investi-- 
gated further* 

The major statistical problem that remains with the simultaneous 
estimation of it^em., parameters and the ability parameters is that these max- 
imum likelihood estimates do not enjoy the properties they are usually 
accorded. Andersen (1973b) points out that maximum likelihood estimates 
of the item parameters and the ability parameters, when estimated simul^ 
taneously, are not consistent* This Is true in general when structural 

parameters are estlmatied in the presence of incidental parameters* Thus, 

the procedure advocated by Wright and Panchapakesan (1969) , Birnbaum (1968), and 
Lord (1968) may not yield consistent estimateH of the parameters* Since 
the estimates may not be consistent, they may not even be asympptotically 
unbiased. ^ 

■ -. . . 42- ■ 



.>40-;; ■ ■ 

Wright and Douglas (1977) provide a correction . for the asyinptotic 
bias of the astimates of the parameters in the Rasch model/ However j it 
should be pointed out that this correction does not necessarily guarantee 
the consistency of the estimates. 

The likelihood function given by Equation [18] Is^ in the strict 
sense J a conditional likelihood function of the item parameters, and ability 
parameters s i*e., ' 

L S L (uj^j^s Uj^2*'^^^ u^^s, ^Nnlj^' ^1'* 'Lv^^N^ 

where Y Is the veccor of Item parameters ^ B^^j •^•s 6^ are the abilities 
of the examlneeSj and u^j is the score of the 1th examinee on the jth item. 
As the sample sl^e Increases and approaches infinity, the number of ability 
parameters' or incidental parameters increases without bound/ Instead of 
becoming stable as the sample size increase, the maximum likelihood estimates 
become Ineffective—in fact they are not even consistent (Neyman and Scott, 
1948; Kendall and Stuart, 1973, p. 63; Andersen, 1973b), Thus, the llke= 
llhood function, when expressed as a function conditional upon the item . 
and the ability parameters, does not yield estimators with desirable pro- 
perties. The problem can be overcome If it is possible to eKpress the 
condlclonal likelihood function in terms of only the Item parameters. When 
this is possible, the item parameters can be^^estimated without reference 
to the ability parameters and the estimates can be expected to have the 
desirable properties that maximum likelihood estimates usually possess,. . 

The likelihood function involving the Item parameters can be expressed 
independently of the ability parameters If a minimal sufficient statistic 
Ti for exists such that T^ does not depend on the item parameters. Then, 
the conditional maximum likelihood estimator of the item parameters, 
is defined as the value of that maximizes 



■ -41" •■- ..... . 

Since, by deflnicion of a minimal sufficient statistic » the iikellhood 
function conditional on - is independent of the ability parameters 
6^, Qg**/'* the vector of item parameters can be estimated without .: 

any reference to 8^ , , . * . , 6^^, Andersen (1970) has shown that such 
conditional maximum likelihood estimators..are^conslstent and asymptotically 
normally distributed. Conditional maximum likelihood estimators that are 
consistent and that are asymptotically normally distributed have been 
obtained for the Rasch model (Andersen^ 1972, 1973a, 1973b), For the Rasch 
model, ^ j , the total score for individual i, la a sufficient 

statistic for (Birnbaum, 1968, p, 429) and is independent of the Item 
parameters. Thus the conditional likelihood function is given by 



t^? ^2*''*V^N^ ^1* ^2***'^ ^n^ 



N n ^ 
5^P(- I I bjU^.)/ n r(t.; b^, b^,..., b_), 
i-1 j-1 - ^ 1-1 ^ ^ ^ 



where 



r(t^^ b^, b2*-... b^) - I exp(^ f u^. b.), 

h : j-1 ■ ^ ~ 
. . " n 

The summation I Is over all response vectors with u- - - t.* The 



conditional likelihood function given above Is Independent of the ability 

parameters, e^, and hence the item parameters b. can be estimated without 

\ ... 1 ..... 

any reference to the ability parameters. The resulting likelihood equations 
(Andersen, 1970) cannot be solved in the closed form. A numerical pro- 
cedure- for the solution of the likelihood equa tlons , based on the Me thod 
of Scoring, is given by Andersen (1972) and the reader is referred to this 
paper for details of the procedure. 

44 



- ^ In the two'-paraineter modelj Birnbaum (1968) has shown that a sufEl^ 
clent statistic for 0^ is J^ja^Uj^j. However, this statistic is a function 
of the unknown parameters aj ^ and hence it is not possible to express the 
condltionnl likelihood function as a function of only the item parameters. 
However, it is possible to express the likelihood function as^ a function 
of the item parameters alone if it is possible to view the examinees as a 
randoni sample from a known population* If we denote the density function 
of the ability parameter 6^^ as qC9)s and the Jth pattern of item responses 
by the vector 

V m (u^ , u^ % 

"J J i . 

then 

m n l^U " ' - 

Prob [vJx3 = J ri P "8 Q 8 q^gj 

J 

where is the vector of item parameters (Lord and Novick, 1968, p. 362)/ 
When the items are dlchotomously scored, there are in all .2^ score patterns* 
If N eKaminees are randomly sampled from the population, the number of 
e.^Tminees with response pattern 1 is r* * where r* — Np . * and E(p.) ^ tt-* 
Thus* the number of examinees with the jth response pattern are distributed 
mul tinomially with parameters N and tt , , whence we obtain the likelihood 
function conditional only on the item parameters as 

■ L ^ N! n 7T. J/ n r.l . 

j-1 a^i ■ 

On maximizing this likelihood function with respect to the item parameter.^, 
we obtain the maximum likelihood estiihates of the parameters. 

Bock and Liebermsm (1970) and Bock (1972) have named. the estimates 
obtained by inaxlmlzlng the likelihood function given above, the 



. . .. -43- ■ 

ii ncondltional maximum likelihood estlTnates^^ since Che likelihood function 
is not conditioned an ;the ability parameters. The term "unconditional" 
eatimates has been usad In a different sense by Wright and^Suglas (1977) 
:^and should not be confused with the usage of the term In this paper. 
Wright and Douglas (1977) tarm their procedure for che simultaneous 
estimation of item and ability parameters as the unconditional procedure^ 
in contrast CO the conditional procedure provided by Andersen (1972) , In 
Che present usage, the estimates obtained by Wright and Douglas (1977) 
are conditional, since the likelihood function employed by them Is 
conditional on the ability parameters* 

Bock and Llebermann (1970) and Bock (1972) have outlined a procedure 
for the unconditional maximum likelihood estimation of the parameters. The 
procedure introduces a further complication to the already cotnplex escima^ 
tion procedure. It Is nacessary to Integrate the likelihood function with 
respect to 8 , As this integral cannot be evaluated in the closed form, 
numerical integration procedures have to be employed. In addition to ' 
this, the likelihood function requires the evaluation of 2" response 
patterns, a tedious task when a large number of Items Is involved. . JFlnally 
the problem of specifying tha density function of the latent variable 6 
has Co be faced. Bock and Llebermann (1970) and Bock (1970) assumed that 6 
Is distributed normally with zero mean and unit variance, an assumption that 
may not be realistic. 

Despite these problems, the unconditional procedure has theoretical 
advantages over the conditional procedure in che two- and threeTparameter 
moclels. Klefer and WolfowiCE (1956) have shown that in structural models , 
if the incidencal parameters are Independently and rdentically distributed, 
then the maximum likelihood estimates of the structural parameters are 



consistent under regularity conditions . In the unconditional approach the 
ability parameters are assumed to be independently and identically distributed 
and hence the unconditional maKiinum likelihood estimator can be expected to 
be consistentv Thus^ as Bock and Llebermann (1970) point out , the uncondi- 
tional procedure provides a standard to which other solutions, can be 
compared. A. further justification is that when calibrating items , it is 
not necessary to estimate the ability parameters and item parameters 
simultaneously. Hence, a sample of indivl^duals can be randomly selected 
from a desired population and the Item parameters estimated without any 
reference to the ability parameters. The estimatesof the item parameters, 
since they have some of the optimal properties, can be treated as known enti- 
ties when estimating the ability of a group of examinees to whom the items 
are later administered. This procedure is particularly attractive since 
the estimation of ability parameters^ when item parameters are known^ is 
relatively straightforward and the ability estimates, in this case, possess 
the properties that are usually accorded the maKlmum likelihood estimateB* 

The estimation of ability parameters^ when the item parameters are 
known^has been discussed by various authors (Lordj 1974a; Samejima, 1969, 
1972, 1973a). The likelihood function for estimating the ability . 6^ of 
the ith individual is given by 

HCuip Ui2, ui^le^)- n q^^'ip . 

•• • ■ • - ■. .. . , J-i , ■ ■ ■ ^ ■ • ■ ■ 

The maximum likelihood estimator of 9^^ Is -sufficient, and ef f Iclent (Birnbaum, 
1968, p. 455-459). Lord (1974a) has shown further that the estimates are 
consistent (Lord's proof of consistency, though valid for a different caee, 
can be adapted to the two-parameter case readily). In addition, the likeli- 
hood function for the two-parameter i^ogistic model possesses a unique 

47 



inaxlmum. However, with respect to the three-parameter loglscic model, 
Samejlma (1973n)haa shown that the likelx.,ood function may not have a 
unique maximum^ if sample size is small and if the range of 8 is unrestricted, 
Samejlma (1973a) goes on further to show that the problem can be solved by 
considering the subdomain of the latent traits 9, such that 

max (B*) .< 9 < 
where / . 

; e| - bg + Clog Cg)/ 2 Dag, 

Cn the subdomain^ the likelihood function. possesses a unique maKimum, and 
hence the maximum likelihood estimators of 8 exist with their usual pro- 
perties « . 

Pro pert ies of maximum likelihood estlTnators 

Lat^ be the maximum llkellhbod estimate of the vector ^ obtained 
by maximizing thelikelihood function L, Then/ under general conditions 
(not satisfied, as we have seen, by the maximum likelihood estimators when 
item parameters and ability parameters are estimated simultaneously) the 
maximum likelihood estimator,^, is asymptotically consistent V unbiased, 
efficient, and a function of the sufficient statistic if a sufficient 
scatistic exists. In addition, P is asymptotically multivariate 
normally distributed with mean t.* and dispersiori matrix 

- {E (S^iog l/3£'3 ^ )}^^. The expression - E (d^ log L/dj^^r) 
is coimnonly known as the information matrix (Kendall and Stuarts 1973, 
p. 55), and is denoted by 1 ( ? ) . As pointed out earlier, the Information 
matrix is the expected value of the Hasslan at the maximum point of the 
likelihood function. The Information function can be expressed in one 
of several ways. I.e., 



48 



The last form is particularly suitable for evaluating the Information matrix 
of compleK likelihood functions* 

The usefulness of the information matrlK is evldentt Since the 
estimates are multivariate normally distributed asymptotically s the inverse 
of the information matrix has along its diagonal the asymptotic variances 
of the estimates. It is then possible to construct confidence Intervals 
for individual parameters and test hypotheses concerning the parameters 
jointly or Individually. 

The information function that is usually of interest la that of the 
estimates of the ability parameters^ i(§i). In particular , if 0^ is the 
maximum likelihood estimate of 9j then ; 

Si /V N {e^ , i/i(§i) } ; 

The inverse of the asymptotic variance 5 l(8^)j is given by 

1 (§i) ^ " E C9^ log L/3 e?) 

^ E 0 log L/3 0^)2. 



= nEO log f (e^)/a e^)^ 

where 

u. . fl-u. , ) 

and P^j (9) is the Item characteristic curve* 

Expressions for the information function for the various models are 
given by Birnbaum (1968* p. 460-462). Thus, it is possible to obtain an 
estimate of the standard error associated with each ability estimate 



ERIC 



For details of an application, the reader is referred to Lord (1953a) where the 
confidence interval for an eKaminee's ability, 9^, is constructed. We 
shall return to a detailed discussion and use of Che information function 
in a later section. ^ 



Heuristic testimation procedures . 

The maximum likelihood estimates, as pointed out in the previous 
section* do have desirable properties^ at least aaymptotically . However^ 
these procedures are costly and time consuming In some situations. When 
cost and time are of concern, heuristic estimation procedures (Urry, 1974; 
Jensema, 1976) that provide rough and ready estimates of the item parameters^" 
niay be employed. 

In the case of dlchotomously scored items, under the assumption that 

Che ability is normally distributed with zero mean and unit variance, and 

that the item characteristic curve is the two-parameter normal ogive, Lord and Nov! 

(1968, p. 377-378) have shown that the correlation p _ between the score 

8 

on item g, u , and the underlying abilitys 8, is given by v^' 



They have also shown that the difficulty of Item g for the group, it , is 

_ _ - g - 

given by . ^ j / . 

■ TTg - 4(=Yg) . : : - - - 

where Yg ^ bg Pg* and #(-Yg) Is the area under the unit normal curve 
from -Yg Co infinity. Since pg Is the correlation between the score on 
Item g and Che latent ability 9, pg Is given as the factor I loading of the 
item on che common facCor obtained by a factor analysis of the matrix of 
sample tetrachorlc correlations. The Item dif f iculty , TTgi is estimated 
by the proportion of eKaminees who answered item g correctly. Thus, once 

50 ^ 



Pg and TTg are determined^ Ig and 6g can be obtained readily. Of courses the 
appropriateness of these estimates will depend on the assumptions meade in 
the- estimation procedure, 

A further parameter s the guessing parameter, Cg, has to be estimated 
in the three-parameter latent trait model; Jensema (1976) , following Lord 
(1968), suggests obtaining a proportion of the eKaTnlnees passing an itf 
at each of the lower Item^eKcluded subtest scores, and using this as an 
estimate of Cg* Once a value for Cg is obtained, the method suggested 
in the preceding paragraphs can be employed to estimate and b 



:em 



i 

Although the above procedure is relatively sim/ple to Implement, 
there are several problems with the procedure. The estimat^a^, 6^, and 
Cg obtained by this method do not have any known sampling properties. 
Secondly, factor analysis of a matrix of tetrachorlc correlations presents 
theoretical problems. The matrix of sample tetrachoric correlations is not 
necessarily positive definite and hence, cannot , in-the strict sense, 
be factor analyzed* _ 

However, despite these problems, Jensema (1976) reports that the 
correlations becween these estimates and the maxinium likelihood estimates 
of the paramecers are relatively high. Hence, these heuristic procedures 
can be taken to provide quick and cost-saving estimates of the item 
parameters when these issues are of major concern. 



51 



■ - --- ■/ - ; .- . - ^49- . ■■■■ ■■. / \ ■■■■ . 

Baynii i^m a>?^ tlmatlon of parame^erg^ 

When prior Information (or belief) about a parameter Is available. 
It is conGeivablfe that incorporation of this information in the estimation 
procedure would increase the ''accuracy" or the meaningfulness of the =\ 
' estimates. An eii^v pJ.e of this was encountered earlier, where in order 

to prevent thii •iSu.iLnates of the item discrimination parameter from drift- 
ing out of bounds . it was necassary to impose limits on the range of values 
the parameter con ..^ take. Similarly^ the distribution of ability, q (6), 
or the prior tnfr :iatlon about 0, was incorporated Into the unconditional 
estimation proceduL Despite these efforts, relatlveiy little Is known ^ 
about the feriSibilicy of applying Bayesian procedures for the estimation 
of parameters In latent trait models* .- • 

It may be instructive to review the logic of the Bayesian estimation 
procedure briefly (for a detailed account, the reader is referred to 
Novlck and Jackson, 1974) * Let T be a parameter of interest mnd x^i 
* , , , , K^s denote N values of observable random variable x whose probability 
density function f (xjT) depends upon the value of the parameter t* 
Supposing further that the Nobservationsare independent the joint prpba^ 
blllty of the observations * or rhe likelihood function, L(k|t)j is 
given by 

L(x|t) - n f CxilT). 

If prior information or belief about the parameter t can be expressed as 
g(T), where gCt) is the probability density function of then the 
postarior distribution of t given the observation , hCrlx^ , , . . . x.,) 
can be exprassed as (Kendall and Stuart, 1973s p. 159) 
hCx |x^, - * - ^ f x^) ^ k L(k| T)g(T) 

; -.52 .. ■ : ■ j 

erIc 



where k is a constant of proportionality. THe posterior distribuMon of t 
is thus an expression of the investigatorb revised belief about the para- 
meter once the data are obtained. - - - 

The procedure for obtaining a Bayes estimator employing prior belief 
has been advocated by, among others, Lindley and Smith (1972) and Novick 
and Jackson (1974)., and has been applied to latent trait models by 

Birnbaum (1969) and Owen (1975). This approach employs the "subjective" 

. .. . I 

notion of probability as opposed to the classical , or , frequency theory 

of probability. A compromise between these two views of probability is 
obtained by employing the empirical Bayes procedure in which the prior 
distribution of the parameter is "estimated" from the data. This procedure 
which yields empirical Bayes estimators, is exemplified by the works of 
Lord (1971b), and Meredith and Kearns (1973). 

Birnbaum (1969) obcained Bayes estiinates for the ability parameters 
in the one-- and two-parameter logistic models under the assumption that 
the Item parameters are known. He chose, for matheTnatical tractability, 
the prior probability density function of 6^ to" be the logistic density 
function, ±,e* 

g(e^) ^ e^^V (1 + e^®i)2 
where D ^ 1.7 is a scaling factor. The likelihood function In this case 
is given by ' 

L(uii,:..,, ^inl9i) r ^ ^gCQi)"^ Qicef) : 

g=i ^ 

where P^(e£) is the Item characteristic curve for the one- or two-^paranieter 
logistic model. ^ The posterior density function of 9^ is then given as 
-■h(ti|uii, u.^,^.^, u^^) ^ L(u.^,..., u.^|0.) g(0.). 



53 



The Bayes esuimator of 0^, 0^^ is taken as the mean of the pOi ,,c 
distribution, i, e. ^ 

For a discussion and further details of the procadures the reader is 
referred to Blrnbaum (1969), , 

The procedure advocated by Birnbaum (1969) is not general enough to 
permit the estimation of Item parameters and ability parameters simultan^ 
eausly. In addition, there is no provision for incorporating available 
Infofmation about the "hyperparameters" that specify the prior distri-- 
bution completely. 

The procedure suggested by Lindley and Smith (1972) for the estima^ 
tion of parameters in the general linear Tnodel can be applied to estimate th 
parameters in the latent trait models* The likelihood function of the . 
observations for fixed 9£, and item parameters a and b^ (for the two^ 
patameter model) is expressed as 

L(u^|, Uj^2? ' ' s^Nn I 62'"v^ ^1' ^2*^^^, a^; bj, b^,,**, b^). 
In^ order to obtain the posterior distribution of the parameters J^, ^5 and 
b^, it is necessary to specify prior distributions. We assume that our 
prior beliefs ^bout a 9^ are no different than about any other Qjs i^e., 
the prior inforTnation is *'eKchangeable" (Lindley and Smith, 1972^ Novick 
and Jackson s 1974)* This implies that the 6^ have the probabiliLy struc-- 
ture of a randotn sample from some conunon distrl Thus, we can 

aHHume thfitS ')|^ ^^^'b' ^0^' turn, we assume that and , Lhe 

mean and variance of the. prior distribution, are independent a priori and 



54 



-52- 

that the density function of is f(Up) and that of (|)g Is h(0g). 

The assumption that che prior information about bj, h2,..., b^^ nnd 
ap a2,..., is exchangeablQ may appear to be implausible. However, 

it is not unreasonable to assumeia distribution for the item difficulty 
parameters. Birnbaum (1968, p. 466) considers the case where the item 
difficulty parametera are distributed normally. Thus, we may assume that 
NCmij, and in turn assume that and are Independently dis- ' 

tributed with known prior distributions. Alchouih it seems unreasonable to 
assume that the ag's have the probability structure of a random sample 
from a common distribution, we may assume that the prior information on 
the a^'a are identical with density function pCag). Thus, the posterior 
discrlbution of e^, 82...-., Sj,, b^, bj,...., b^, a^ a2 ...... a^. 

Uq, Mjj, <t^f ^-^i gi'ven the observations is 

P(f!l,02..-..,ej^, bi,b2,...;b^. ai.a2"--'^n'^'8'*n'^^b'*bl»ll'"l2'"-'^%,> 



LCu^^,u^2,...,u^j8^,02,...,Ojj,b^, bg,..., b„, ai,a2,...,a„) 

- exp -'i^f^i»i-UQ)- + ^ 



^ ^fl ^'^pf^flFQ^S.-Mg)^ +i (bg-Ub)2}jp(ag)}f(,,^)f(,j^)h(^^^)h(ci.^) 



In this case Mg, *g , v^, and are nuisance parameters and could be 
removed by integrating the posterior density function. The resulting 
posterior density function is only a function of the parameters Oj^, 0^, ... 

> b^, a^^,..., a^. Joint modal estimates of these parameters 

can be then obtained by differentiating the posterior density function, 
setting these derivntives equal to zero, and solving the resulting "Lindley 
Equations." Alternatively, the e's could be estimated independently of 
the a^'s and the b^'s by integrating with respect to these parameters. ' 
The Joint modal estimates of 0^ , , . . . , 6 can be obtained by solving the 



resulting Llndlay Equations, or aUernatively , tha marginal density function 
say, 0^, can be obtained by intagrating out the other ability parametory . 
The mndt2 or the m^iin of the rnarglnal distribution of 9^ could be then taken 
as the BavGH estimate of 6^. The same procedure may then be applied Co 
each at thu r e ma in ing parameters. 

The procedure oiitlined above requires specification of prior dlstri-- 
but ions for che parameters 0. b, and a, and also for the hyper parnmetors 
Mq, u^, i^^, and ^j^. In the case of prior ignorance, we may take f(u^) 
and f(M|^) CO have uniform distributions. Prior ignorance on and 
implies that hOp^) ^ and similarly for At this point, sped ftcatlon o 

p(a^), Che prior distribution of a^, is unclear, but could be specified in 
temis ol' a rapidly decaying exponential function. 

Alternat tvely, an empirical Bayes procedure could be used to escimate 
the parameters in the latent trait models. This approach requires the . 
specification of prior distributions for the parameters^ but the hyper-^ 
parameters that specify the prior distribucions are, in general, estimated 
from tliu data. Meredith and Kearns (1973) have applied this procedure 
to the Rasch model and obtained empirical Bayes estimates of the ability 
parametor by expressing the likelihood function in terms of the sufficient 
Htatistic, the total score of an examinee* 

The Baycsinn procedures discussed above are obviously more complex 

I 

thau the estimation procedures djLSCUSsed in the preceding sections. In 
addUiun, Bayes procedures require specification of prior information on 
tliu paramytc»rH ()r tnterest and thus involve a Hubjective view of probability 
as opposed to the classical or frequency theory of probability. However, 
when applLcahle, Bayesian procedures yield more satisfactory solutions^iu that 

56 



-54- 

improper solutrons do not usually occur. Moreover, it is well known that 
Che Bayes procedures, with the exception of the empirical Bayes prncedure, are 
in general, aAnissible, in the sense that they minimize the expected loss 
(Meredith and Kearns 1973). Furthermore, Bayesian credibility intervals 
may be more meaningful than conventional confidence intervals. In addition, 
as demonstrated by Owen (1975) and Meredith and Kearns (1973), the Bayee 
estimates converge in probability to the true value with increasing sample 
size. Finnliy, the Bayes procedures have the potenttal of offering 
soluclons to the estimation problems in the latent trait models when 
the sample size and the number of items are small. In these situations 
prior beliefs assume importance, while with increasing sample size they 
tend to lose their importance* 

Despite these advantages, further investigation is necessary regarding 
tlie Bayesian procedures. The estimation procedure based on the approach 
of Lindley and Smith (1972), outlined earlier, has not yet boon imple-- 
mented and its usefulness has to be further documented. in particular, 
little is known about the families of priors that are appropriate, especlaliy 
since natural conjugate priors are not available for the latent trait 
models of interest. Finally, the effect of specifying poor priors on the 
estimates has to be studied carefully. In conclusion we note that while 
Bayesian procedures hold the promise for solving the estimation problems 
in lattjiit trait theory, considerable research is required before definitive 
statements can be made regarding the efficacy of these procedures. 



57 



^l ^tiniation in the nomi nal response, grad e d 
response, and continuous response mode 1 s 

As opposed to the dlchotomous response model, In the nominal 

response model it Is assumed that each of N examinees responds to n 

multiple-choice items of which the jth item has m. response categories. 

In this case, the probability that an eKaminee of ability 6 will 

respond to item J. by choosing category kj is given by 

m. 

i'Au.W - exp [z., C0)] / f exp [z.^Ce)] 

Andersen (1972) chose Zjj^Ce) to be of the form 

^jh(G) ^ bjh + 6, 
Hor.k (1972), on the other hand, chose z^Y\ the form 

Since responses to the m.-l categories fix the response to the m th 

J J 

response category, we have the restrictions bjj^^ ^ ^^ir the one=pnrameter 
model, and I^bjj^ ^ 0, Ij^^jj^ ^-0 for the two--parameter logistic model. 
The simplest way to incorporate this restriction is to take b*. ^ 0 and 
** jin^ ^ ^» alternatively, reparameterize the model as indicated by 
Bock (1972) , 

^ Andersen (1972) obtained conditional estimates for his one- 
parameter nominal response model by Tnaximizing the likelihood function^ * 
conditional on the sufficient statistic. As indicated earlier, these ' 
maximum likelihood estimates of parameters are consistent. Bock (1972) 
obtained unconditional estimates of the item parameters and also thu 



58 



-55- 

condlcional estimates of the item^as well as ability^ parameters in the 
manner described in an earlier section. 

The graded response model and its natural extension, the cont:lnum,s 
response model were introduced and studied by Samejima (1969, 1972, ■ 
19736,1974). Although: Same j ima does not discuss the estimation of para- 
meters in detal] for these models, she derives important results con- 
cerning these estimates. She shows that,. unlike in the dichotomous response 
<=^'s-, both the normal ogive and the logistic models yield sufficient . • 
stattacics for the ability parameter. In addition, she shows that the 
nmount of information Increases by shifting from dlchotomous scoring 
to grnded and continuous scoring. Hence, the graded and continuous re- 
sponse models offer advantages over the nominal and dichotomous response , 
models in Chat the Information available increases. ^Furthermore, the 
problem of estimating ability parameters in the graded and continuous responnu 
..HHlels appears to be solved. We may, however, expect difficulties when 
estimating item parameters and ability parameters simultaneously. it ^ 
nppenrs that these problems may be solved by employing the procedures 
discussed earlier, but further research is needed to establish this. 



'57- 

Testing Assumptions and Goodness of Fit of Latent Trait Models 
Ass umptions 

How reasonable is the assumption of unidimensionality or (as 
has been shown to be equivalent) the assumption of local independence? 
Lumsden (1976) was particularly distressed that more researchers do not 
attend to this assumption. Testing the assumption of unidimensionality 
takes precedence over other goodness of fit tests of a latent trait model 
since, if the assumption of unidimensionality Is untenable, the results 
of the other tests are more difficult to interpret- For example. If tests 
of goodness of fit of the model indicate that a particular latent trait 
model does not fit the data, and if unidimensionality was previously 
cistablished, then at least this potential explanation of the misfit 
between the model and the data can be ruled out/ 

The simplest way to ascertain unidimensionality is to factor analyze 
the macrix of incer-item correlations. Existence of a single factor would 
imply unidimensionality. Lord (1968) reported that various researchers 
have factor analyzed matrices of tetrachorlc item intercorrelations 
to determine if a set of test Items measure more than a single factor* He 
noted that the residuals after extracting one factor were often near the 
size that one would expect from sampling fluctuations. For example, 
Coffman (1966) extracted 11 factors for the SAT Verbal Test but most ■ 
of the variance could be accounted for by the first factor. On the 
other hand, llambleton and Traub (1973) were less successful in locating 
unifactoral tests, but in the three aptitude tests thnt they studied, they ' 
did find a "dominant'" first factor. 

60 



-58-- 



Another assumption of latent trait models concerns the particu^ 
lar cholca of a mathamatlcal form of the Item characteristic curves 
to describe the test data* Since latent traits are not directly 
leasurable, we find ourselves in a situation where it is quite dlf^ 
ficult to separnte the Inappropriateness of a particular choice of 
mathematical form for item characteristic curves from violations of 
other assumptions of the model. One solution offered by Lord C1970a) is 
to compare item characteristic curves derived from a direct method 
(where the mathematical form of Item characteristic curves does not 
have to be prespecif ied) with estimatad item characteristic curves 
of the form specified by the user. The "closeness" of the two sets 
of item characteristic curves provides a basis for checking the 
appropriateness of the assumptlDn, (Incidentally, when Lord attempted 
this comparison with SAT test data, he found close agreement between 
a direct method of item characteristic curve estimation and three^ 
parameter logistic curves*) 

A second possible test of the assumption is to check the "accu- 
racy" of various predictions with the estimated item characteristic 
curves of specified form. Accurate predictions provide evidence of 
the suitability of the model for the particular data set and, of 
InterGst here, the asBumptlon concerning the mathematical form of 
item characteristic curves. Of course, If the predictions are not 
good, pinpointing the problem could be difficult. Several researchers 
(for example, Hambleton and Traub, 1973; Ross, 1966) have attempted 
to study the appropriateness of different mathematical forma of Item 
characteristic curves by using them. In a comparative way, to predict 



various test scory characteristics. Hambletqn and Traub (1973) obtained 
Item parameters for one- and two-parajneter logistic curves with three 
aptitude tests. Assuming a normal ability distribution and using test 
characteristic curves obtained from both the one-- and two-parameter 
logistic curves, they were able to obtain predicted score distributions 
for each of the three aptitude tests. A measure of goodness of 
fit was used to compare actual test score distributions with predicted 
test score distributions from each test model. The "relative" appropriate 
ness of the two mathematical forms of item characteristic curves was 

9 

Studied by comparing the statistics, A likelihood ratio test for 
comparing the "relative" appropriateness of two mathematical forms of 
item characteristic curves will be discussed later in this section. In 
all three cases, substantially improved predictions were obtained with 
the two-^parameter logistic curves. The Hambleton^Traub results also 
suggest, not surprisingly, that the two-parameter logistic model will 
provide the greatest Improvements over the one-parameter logistic model 
when applied to data from short tests where the variability of discrimin- 
ation parameters is substantial. 

G oodness of Fi t 

Statistical tests of goodness of fit of the various latent trait 
models have been given by several authors (Andersen, 1973 | Bock, 
1972| Mead, 1976; Wright, Mead, and Draba, 1976; Wright and Panchapakesan , 
1969). The procedure advocated by Wright and Panchapakesan (1969), 
for testing the fit of the Rasch model, essentially Involves examining 
the quantity f_ where f_ represents the frequency of examinees at 
the ith ability level answering the jth item correctly, .llien, the 
quantity y where o 



-60- 



is distributed normally with zero mean and unit variance. Since f 

hag a binomial distribution with parameter p,.^ , the probability of 

a correct response is given by ej/Ce* + b|) for the Rasch model, and 

r^, the number of examinees in the score group. Hence. E(f ) « r p 

ij i ij ' 

and Var (£_._.) = i^iPij Cl-P^j ) . Thus a measure of the goodness of fit, 

2 , 

X , of the model can be defined as 

2 n 
X = S I y 2 . 

i=l j=l ^- 

The quantity, xS defined above has the distribution with degrees 
of freedom (n-1) (n-2) since the total number of observations In the 
matrix F = {f^j }v;is nCn-1) , and the number of parameters est^mated is 
2 (n-1). Wright and Panchapakesan (1969) also defined goo. mess of 
fit measure for individual items as 

2 - 

where x? Is distributed as with degrees of freedom, (n-2). This 
general method of detarminlng the goodness of fit of overall test 
data can be extended to the two- and three-parameter latent trait 
models. The reader is refetiud to Hambleton and Traub (1973) for 
an •example of a test of goodness of fit applied to two- nnd thren- 
parameter logistic models. 

_ There are several problems associated with the chl-aquare tests 
of fit discussed above. The test has dubious validity where any 
one of the E(i^p terms, i - 1, 2, ...,^n - 1; i = 1, 2, .... n, 
have values less than one. This follows from the fact that when any 



of the E(f_) terms are less than one, the deviates y^, 1 = 1, 2, n-1- 
j ^ 1, 2, n, are not normally distributed and a distribution 

is obtained only by summing the squares of normal deviates. Another 
problem encountered in using the x2 test is that It Is sensitive to 
sample size. If enough observations are taken, the null hypothesis 
that the model fits the data will always be rejected using the x'^ 
test. However It should be pointed out that this is an inherent 
weakness of all statistical testsi 

Alternately, Wright, Mead, and Draba (1976) and Mead (1976) have~^suggcs- 
ted a method of test of fit for the one parameter model which invol- 
ves conducting an analysis of variance on the variation remaining in 
the data after removing the effect of the fitted model. This pro-- 
cedure allows not only a determination of the general fit of the data 
to the model but also enables the investigator to pin-point guessin>4 
as the major factor contributing to the misfit. This procedure for 
testing goodness of fit of the one parameter model involves computing 
residuals in the data after removing the effect of the fitted model. 

These residuals are plotted against (e,-b ), According to the model, 

1 g 

the plot should be represented by a horizontal line through the 
origin. For guessing, the residuals follow the horizontal line until 
the guessing becomes important, men this happens the residuals are 
positive since the person is doing better than expected and in that 
region have a negative trend. If practice or speed is Involved, the 
items. which are affected display negative resldunls with a negative 
trend line over the entire range of ability. Bias for a particular 
group may be detected by plotting the residuals separately for the 
two groups. It is generally found that the residuals have a neRatlve 

64 



-52- 



trend for the unfavored group and a positive trend for the favored 
group* 

Mead (1976) concludes by saying "All of the disturbances consld^ 
erad reprasent some fortn of multidimensionality ; they would violate 
any model that assumes unldimensionallty. Since the effect of the 
disturbances often appears as a change in the slope of the item char-' 
acteristic curve, any model which includes item discrimination as a 
parameter would appear to fit the data". 

When maximum likelihood estimates of the parameters are obtained, 
likelihood ratio tests can be obtained for hypotheses of interest. 
Likelihood ratio tests involve evaluating the ratlOj X, of the max^ 
Imxm values of the likelihood function under the hypothesis of inter- 
est to the maximum value of the likelihood function under the altar= 
nate hypothesis. If the number of observations is large, -2 log X is known 
to have a chi-aquare distribution with degrees of freedom given by 
the difference in the number of parameters estimated under the alter- 
nate and null hypotheses. An advantage possessed by likelihood ratio 
tests over the other tests discussed earlier is apparent. Employing 
the likelihood ratio criterion^ it is possible to assess the fit of 
a particular latent trait model against an alternaclve. 

Andersen (1973) and Bock and Liebermann (1970) have obtained 
likelihood ratio tests for assessing the fit of the Rasch model and 
the two-parameter normal ogive model respectively, Andersen (1973) 
obtains a conditional likelihood ratio test for the Rasch model based ; 
on the within score group estimates and the overall estimates of item 

difficulties. He shows further that -2 times the logarithm of this 

. 2 
ratio is distributed as x with degrees of freedom, (n-1) (n«2) . 

■ 65.:; . , ' .■ 



-63- 



Based on the work of Bock and Lleberniann (1970), likelihood ratio 
tests can be obtained for testing the fit of the two-parameter nor- 
mal ogive model. It should be pointed out that these authors have 
obtained both conditional and unconditional estimates of the para- 
meters. For the likelihood ratio test, it would be more appropriate 
If the unconditional model is u-i-d since with this model ability 
parameters are not estimated, and hence the likelihood ratio cri- 
terion can be expected to have the chi-square distribution. This 
procedure can be extended to compare the fits of one model against 
another (Andersen, 1973). 

The major problem with this approach is that the test criteria 
are distributed as chi-square only asymptotically. When large saniples 
are used to Hccpmniodate this fact, the chi-square value may become sig- 
nificant owing to the large sample size! Further Investigation is 
clearly needed in this area In order to resolve this dilemmffi,, 



66 



Test and Item Information and Efficiency Curves 



The precision with which examinee ability can be estimated is of con^ 
sidernble importance, When the maximum likelihood estimate of ability is 
obtained, the precision of the ability estimate can be conveniently expressed 
in terms of the information function, referred to here as the test infor - 
mat Inn curve* The standard error of maximiim likelihood eBti mates is given 
by the square root of the inverse of the information curve* Birnbaum (1968) 
defined inf orination as a quantity inversely proportional to the squared 
length of the confidence interval around an examinee's ability* Thus, 
when information at an ability level is high, we have narrow confidence 
bands around our estimates. If information is lowj we have wider confi-^ 
dence bands* Because the test information curve is a function of ability* 
it has been. Buggesced that test Information curves ought to replace the 
use of classical reliability estimates and standard errors of measurement 
in test score Interpretations* 

In mathematical terms, Birnbaum (1968) gives the information curve of a 
given scoring formula by 




[27] 



In the expression above I (6) is the amount of information at ability 



level 0 provided by the scoring formula y, where 



n 



y 



g^l ^ ^ 



[28] 




67 



ThB varlabltj u^, takes on values 0 or 1 depending on whether or not item 
g is answered correctly; Pg is the probability of a correct answer to item 
g by an eKamint^e with ability level tl ; Q is equal to l-^P.-,; P.' is the 

fa e S 

slope of the item characteristic curve at ability level 8; and the item 
scoring weights are Wg, g^l, 2, 

Birnbaum (1968) demonstrated that the maxiTnum value of 1^(0) referred 
to as the ^st Information curve ^ is given by 



n Pg 



p2 



- 2 ) - [29] 

The maximum value ol the Information curve of a given scoring formula Is 
obtained when Che scoring weights, Wg, are given by 
P ' 

g g 

In order to obtain the test information curve for a particular 
set of test ItemSj and consequently mlnimlEe the widths of confidence bands 
about examinee ability, it has been shown that the scoring weights for the one 
two-, and three-parameter logistic test models should be chosen to be 1 , Da^, 

Dao (Po^Crr) . / ■ ■ 

-r, . t ' respectively (Lord and Novlck, 1968), (Test information 

(1-Cg) Pg 

curves and the best scoring weights for several other latent trait models 
are given by Samejlma [1969, 1972],) It should be noticed that only for 
the three^parameter inodel are the scoring weights a function of ability 
level* The scoring system in the threes-parameter model has the effect of re-- 
ducing tlie weight assigned to correct answers on items where the values of the 



68 



■ ■ ■ . . ■ ■ . ■ ■ . ^ ! 

lower asymptote (Cg) of the item characteristic curves are large* It can be 
seen Chat the weights for such items are smaller for low-ablllty examinees 
Chan for either niicidle^ or high^abillty examinees. These weights reflect 
the fact that low--ability examinees are most likely to be answering the 
items by guessing. For high ability examinees, the optimum scoring weights 
of^ items approach the quantity, Da^(g=l, 2, n). 

The quantity Pg'^/p^q^ in Equati^ [29] is the cbhtributibn of item r 
to the information curve of the test* For this reason it is called the 
Item information curve . 

Item information curves have an Important role in determining the 
accuracy with which ability is estimated at different levels of 6, Each 
item information curve depends on the slope of the particular Item char^ 
acteristic curve and the conditional variance of item scores at each ability 
level 0* The steeper the slope of the item characteristic curve and the 
smaller the conditional variance, the higher will be the item 3\nformation 
curve at that parClcular ability level. The height of the item information 
curve at a particular ability level is a direct measure of the usefulness 
of" tK(f ^item for precis^^^ that level. 

- Figure 2 shows item information curves for fivv-; verbal, test items* 
and the test information curve for a test composed of these items*- The 
-Logistic parameter:B of the^ five' items are shown belowr p^' ^ ^ ^ 



Item 








T^Q 


1,1 


2.0 


.05 




---- - 




■ ^-";20 


13 


■ -0.1 


1.6 


.16 


30 


: 2.4 




.09 


: - .," ... 47 " r 


-0.4 


■ . 4 . 


.20 



-We are grateful to Frederic Lord for allowini us to reproduce 
this figure from (Lord^ 1968). . 



4.0^ 



-67- 



2.0 



10' 



13 



L .. \ V 



0.0 



/ 
/ 

47 



>'. 



\ 



30 



' Ability^ 



Figure 2. Information curves estimated for five items 

and a_five-±Cem test* The items are from Che 
verbal section of the SAT, This figure is 

_ _ . .. reproduced by permission from Lord (1968) , . 



70 



ERIC 



The informatiorL. curve for the test coraposed of the items Is obtained 
by summing the ordinates of the five item Information curves. This curve, 
reveals that the five-item test provides the most information for high 
ability students. This means that abilities are more precisely . 
estimated at the high end of the ability contlnuura. The height of the 
■ test information^ curve is misleading though becausa of its dependence on 
the metric of the ability scale (Lord , 1975d) . At a particular ability C 
level, 0, the item Information curve is given by P'^/PQ. But the slope 
of the item characteristic curve P', is a function of the ability scale. 
If the ability scale is compressed , in the region of Br P V will increase; 
and, when the scale is stretched In the region of 6, P' will decreasa 
A good example of the effect that a monotonic transformation on the ability 
scale has on the test information curve is seen ln_Lor^^ . _:,The _ 

effect is substantial. r - ■ - 

"FDom Equation [29] it is clear that items contribute Independently 
to the test information curve. Birnbaum (1968) has also shown that 
with his three-parameter model, an Item provides maximum information at an 
ability level 6, where 



^ ° fag + - l log^ .5 (1 /m^) . [31] 

If guessing is minimal, then c^ '= 0, and 8 = bg. When > 0, the point 
of maximum information is shifted to the right of the item difficulty 
value,- bg.; ■ ^ ^ 

. iL^^^ weights are used w^^^^ logistic 

test model, the information curve derived Crom Equation [27] will be 

lower, at all ability levels, than one that would result from the use of 



erJc 



71 



optimal weights. Birnbaum (1968) used the ' term efficiency to refer to 
the information loss due to the use of less than optirnal scoring weights. 
Efficiency is studied by calculating the ratio of the values of the in forma- 
tiun curve of a given scoring formula a- i the test information^curve atireacK 
ability level (Hambleton, and Traub, 1971), Hambleton and Traub (1971) 
found that when there was no guessing (i.e., c„-0^ g^l, 2, n) , the 

efficiency of unit scoring weights was quite high (over 85%) for typical 
levels of variation in the item discrimination parameters (,20 to 1.00, 
which very roughly translates into a range of biserial correlations from 
.20 to *70), When guessing is introduced^ the situation changes dramat^ 
Tcally. ^ For low ability examinees , the efficiency of unit scoring 
dropped to the 50%--70% range. The authors concluded that when a test is 
being used to estimate ability across a broad range of the ability scale 
and when guessing is a factor in test performances the scoring system of 
the three-parameter logistic model is to be preferred* Unit scoring weights 
iead to efficient estimates of ability when there is little or no guessing 
and when the range of discrimination parameters is not too wide. 

Lord (1968) Investigated the efficiency of unit scoring weights on - 
the verbal section of the SAT. Under the, assumption that the. three-/ 
parameter model was the correct" test model to explain the data, he found 
that the efficiency of unit-scoring weights varied from 55% at the lowest 
ability level to a maximum of 90% at "the highest abilf'lty level, MJslng 
unit scoring weights was equivalent to discarding about 45% of the test 
I r omfi for thu 1 i)W-nbt If ty exam I neus 1 " . 



. On. other occasions, one may be Interested in coniparliig-.the relative - 
efficiency with which two different tests measure the same ability at " 
various points on the ability scale. It may also be of interest to 
know the relative efficiency of two different scoring methods at various 
ability levels. This can be determined by calculating the ratio of the 

two test inforniat ion curves at each ability level/ (It should ' 
be mentioned that thejnotion of relative ^ efficiency is an important one 
in assessing the merits of a test for measurl^ an ability 

continuum but Lord tl974c, 1974d] has also produced a way of studying relatl 
^MA9M'^S>l.-^th?lHt lin^ the concepts .^qf latent Strait theory)^- :1 ^ ... . 

It has been shown by Blrnbaum (1968) , that relative efficl 
directly proportional to the test length of a "base^line-V test . That is, 
ALj~el^lye^Mficlency^^ 

take 1% times as many items in the baseline test to yield the discriminating 
power at that ability level , provided by the other test under consideration. 



- . ' ■■■ -71- • ■ ■ -. ■ .; ■ . . - ■ ■ . 

;■ Applications .of Latent Trait Models 

. . . ..... ^ ...... . 

" ' i ■ ' ' ' ■ 

- In chip section we will coiiisider several applications of latent trait" 

■ j - ■ ■ 

models. I - 

. ■ - ' J ■ ■ ■■ 

■ ; V : ■ ■ . i . ' . ■ ; ■' ■ • " "- ".. ,' . 

Differential Weighting of Response Alternatives 

It isj a common belief among test developers that it ought to be 
possible to construct alternatives for multlple^choice test items that 
differ in their degree of correctness. An examinee' s test score could 
then be based on the degree of correctness of his. or her response alter-^ 
native selections^ instead of simply the number of correct answers, possi^ 
JlhZ.^^^I^^^^.JS^^^^ Jew excep-t ions , : the_.^r^^ 

differential weighting of response alternatives have been disappointing 
.Vang 6 Stanley, 1970). Despite the intuitive beliefs of test developers 



weighting of response alternatives has no consistently positive effect on 
the reliability and validity of the derlyed. test scores. However, our 
view is that using correlation coefficients to study the merits of any new 
scoring system is less than ideal. This is because correlation coefficients- 
will not reveal any improvements in the estimation of ability at different 1 
regions of the ability scale, A concern for the precision of measurement 
at different ability levels is important. There is reason to believe that 
-the- largest gains -in precision of measurement to be derived from a scorii^^^^^ 
system that Incorporates scoring weights for the response alternatives 
will occur with low ability examinees. High ability examinees make rela-- 
' ri vel'y^f ew e r ro re' M ^ e use 



ERIC 



of differentially weighted Incorrect response alternatives. The problem- 
with using a group statistic, like the corrielation coefficient to^ reflect - 
the improvenients of a nyw scoring system is that any gains at ; he low end 
of the ability continuum will be "washed out" when combined with the lack 
of gain in information at other places on the ability continuum/ One Way 
of evaluacing a test scoring method is in terms of the precision with 
which it estimates an eKaminee's ability i The more precise the astlniate, 
the more information the test scoring method provides / Birnbaum* s concept 
of information Introduced earlier provides a much better criterion than do 
: ^^^.^.^.™H^..S£5££j5i^ J y^SipS^ 1^ merits of ne w scor ing methods^ 

Thissen (1976) applied the nominal; response model to a set of data 
from Raven's Progressive Matrices Test (a test where the options to each 

The model provides a measure of the preclslori of ability estimation 
(Birnbaum's "information-') at each ability level. Thissen's results were 
clear and impressive. The nominal response model produced substantial 
Improvements in. the precision of ability estimation in the lower half of 
the ability range. Gains in information ranged from 1/3 more to nearly 
twice the in formaclon -derived from 0-1 scoring with the logistic test 

modelv According to Bock (1972) , most of the new information to be 

derlved^^from >welghted response scorini com 

eKamlnees who choose plausible or partly correct answers from those who 
omit the items* . 

"^■^""In a study of vocabulary test Items with the' : 
model, Bock (1972) found that below median ability, there was 1% to 2 



.times more information derived from the nominal response model over the 
usual 0"1 test' scoring method. In terms of test length, the scoring 
system associated with the nominal response model had, for about one- 
half of this examinee population, produced = improvements i precision of 
ability estimation equal to the precision that could be obtained by a 
binary-scored test 1% to 2 times longer than the original one with the 
new method of scoring* Also^ encouraging was that the "curve*' for each 
response alternative (estimated empiric^ally) was psychologically Inter^ 
pretahle- The Thlssen and Bock studies should encourage other researchers 
to go back and reanalyze their data using the nominal response model and 
the measure of "information'' provided by the logistic latent trait 

models. The Thissen and. Bock studiea indicate that there is "information" 
that can be recovered from incorrect examinee responses to a set-of test 
icems and provide interesting applications jf test information curves to 

compare different' test scoring methods, 

Criterion-Referenced Testing 

Latent trait models provide an excellent underpinning for a theory 
and practice of criterion-referenced testing. Much has been written on - 
the topic of . criterlon-referenced testings but the area is suffering 
because of a great many disconnected contributions, confusion over many 
basic problems such as test development and test score use, and the exist- 
jence of unique problems such as the establishmen scores 
(Hambleton & Novlck, 1973). 

I A criterion=-referenced test is constructed by sampling items from 

a well-defined domain of items measuring an instructional objective 



(Mlllman, 1974), (Typically v a cricerion-refarenced test w^^ , . 

sets of test Icems/measurl^^ 

several objectives are measured, the steps described below are repeated . 
for each set of items measuring a single objective.) 

One primary use of a criterlon-refarenced test is to obtain an 
estimate of an examinee ' s level of mastery (or "ability") on an objective.' 
Thus, a straightforward application of one of the latent trait models 
(the assumption of unidimensionallty would not likely be a problem) could 
produce examinee abiiity scores- Among the advantages of this application 
would be that item s co uld bte safmp lad (f^ ^ 

item pool for each eKamlnee, and all examinee ability estimates would be 
on a common scale (Hambletons 1977). 

Since item parameters are invariant across groups of examinees, 
it would be possible to construct criterion-referenced tests to "dlscrl-- _ 
mlnate" at different levels of the ability continuum, Thusv a test 
developer might select an "easier" set of test items for a pretest than 
a posttest, and still be able to measure "examinee growth" by estimating 
examinee ability at each test occasion on the same abirity scale. This 
can not be done with classical approaches to test development and test ■ 
score Interpretation. If we had a good idea of the likely range of = 
ability scores for the examinees, test items would be selected sq^-^s to : 
-maximize the.- test information in the region of ability for the examinees 
being tested. The optimum selection of test items would contribute sub-- 
sFanHariy to the precision with which ability scores were estimated r 
In the case of criterion-referenced tests, it is common 



77 



0 

EKLC 



-15- : ' ' ■ ■ ■ 

test performance on a pretest than on a posttest; therefore, the test 
constructor could select the easier test Items from the domain of items 
measuring an objective for the pretest and more difficult items could be 
selected for the posttest. This would enable the test constructor to 
maKimize the precision of measurement of^each test in the region of . • 
ability where the examinees would most likely be located* Of course^ if 
the assumption about the location of ability scores was not accurate, 
gains in precision of measurement would not be obtained, 

Hambleton (1977) conducted an eKtensive study of criterion^ref erenced 
test designs in various testing situations and has reported substantial^'^' 



gains in test efficiency when the proper test design for a particular 
group of examinees Is used, c^^^ 



78 



^ Test Development 

In this section we will attempt to describa a few of the areas, of ^ 
test development to which latent trait theory has been applied and shown 
to have decided advantages over standard test construction technology. 
It can be anticipated that as more ia discovered about the properties of 
■ the latent trait models and as more psychometricians begin to use these 
models in the test development process^ greater insight into the process 
will accrue- • ' • • — 

Latent trait theory offers two advantages to the psychometrician 
- ^^^.^.^i^i.^-t e s^fe e d" j^in^^ d e ve 1 op In g-*-^ t e s-t s ^ =^-f .1-^^^- in vs=r i an t i t snii p a^r ame t e r t hn-t ^f Rci J, . . 
Itate the test development process as well as make possible the develop- 
ment of tests for a variety of applicationSs and (2) item characteristic . 
curves that provide valuable insights into how examinees perform on 
specific test items. 

The first step in the test development process is the determination 
o6 test specifications* One of these specifications is the type of test 
item to be employed. The worker using latent trait theory has two options 
open to him/her. Either items can be developed to fit a specific test 
modal or a test model can be chosen to "fit"' the derived test data. For 
example, one may select the three-parameter logistic test model if the 
i t ems a f e of the mul fc ip le-chb ice type r^'"Howevef7 i f h e / s h e^Tf e 1 t~¥Er b~ng ]^^y~-~ 
that Che test should be developed using the one parameter logistic model, 
he/she would include in the test specif icacions that the items be con-^ 
structed to minimize guessing and to have equal discriminating parameters. 



79 



■■■ ■ ■ ■ ■ ■ =77- . ;• 

After the test specification process is completed, the actual con- 
struction of Che items is generally 'the next step* . In many instances 
previously' constructed itenis may exist that are appropriate for test usag 

Suppose that an appropriate pool of pretested items does exist. 
If these Items were characterized by classical test theory parameters 
.crescrLblng Item difficulty and item discrimination , the usefulness of 
the item statistics in test development would depend on the match between 
the characteristics of the pretest sample of examinees and the population 
of examinees In which the test will be used\ Another shortcoming is that 
the size of item discrimination indices depends both on the number, and the 
particuiar items included in the pretest. When; items are placed in 
n test which has test items different from those in the pretest, the 
usefulness of the discrimination indices is unknown , Because of the 
invariant properties of the latent trait item parameters this problem 
is circumvented. " \" 

How does one select items from an existing Item pool in order rto 
construct a test that meets a set of previously determined specifica- 
tions? If standard test development technology is employed, 
there are a series of calculatjons that can be carried out to predict the 
mean, standard deviation, and test reliability (Lord 6i Novick^ 1968) . 
Input data for the calculations are the pretest item statistics. 

. Lord (1977b) outlined a method for predicting the mean, squared 
standard error of measurement and the test reliability based on any set 

: items characterized by latent /trait theory parameters procedure:; : 
Lnvolves specifying the ability level of the group for whlcti the test Is . 



80 



incended. The following expressions can then be used to determine the test 
statistics of interest: . : v : 

. . " N n - - ■ • 

(1) ; Hx ^ i/N >: K p . / 

a=l f>=.l 

-2 N n ■ 

(2) ij 1^ = 1/N T. j; P Q 

(3) p , = L - aj,,/ a2 : . - - 

- . XX X [ t X 

Thu predicted score variance (a^) can be computed from the 

^fxrodl&fcBd^^ te&t--se^r^^ d . 



Lord (1977b) conclLidecl by saying that . .if we have a pool of pre- 
tested item^ all measuring the same trait or ability, we can predict 
the mean, variance, reliability and raw-score frequency distribution of^ 
any test constructed from these items once we know the ability levels in 
the grniip to be tested." l^men the shape of the nblllty d ist r ibut Lon 
lor a population of examinees can be specif led , Lord C1953a) has 
shown how to use latent trait parameters to select items so as to 
produce- desired test score distributions. . 



, To summarize, when a psychonietrician is selecting items character- 
ised by classical test ^ thio^y^i^^SJier^ .^ ^^^.^ ..^^^^ 

forced to use a heuristic^ process that depends a great deal on the 
nbllicy to estimate, from previous experience, the average item test 
correlation and also on the similarity of the pretest group and the grouir 
of interest. When using latent trait theory only knowledge nf^the 



erJc 



81 



ability distribucion of the group of examiness of interest is necassary to 

make accurate predictions of the test statistics. 

;rest Inforinatlon curves may also be used as a means of selecting 

Items from a proviously establlahed pool of items characteriged by latent 
trait theory parameters. The useful feature is that the contribution of 
ench item to the test information curve can be detBrmined without knowledHe 
of the other LCems in the test. In conventional testing technology, thf. 

situation is very different. The contribution of any item to such 
statistics as test reliability, cannot be determined independently of the 
chnracterlstlcs of all the other items in the test. 



lord (1977b) discussed Birnu-^um' sv (1958) procedure for building a new 
test. This procedure operates on a pool of calibrated items (so that an Item 
information curve is available for each item). The procedure outlined by Lord 



1. Decide on the shape. of the desired test information curve. 
Lord (1977b) calls this the target info rniation curve . 



Select items with item information curves that will fill up 
the hard-to=fill areas under the target information curve. 



3. After each item is added to the testj calculate the tist 
, information curve for the selected test items* 

4. Continue selecting test items until the test information 
ctirves approximates the target Information cu to a 
satisfactory degree. 



It Is obvious that the use of item information curves in the manner 
described above will allow the test developer to produce a test that will 
very precisely fulfill any set of desired test specifications. 

Latunl LrnU mudelH not only allow tha Lest developer to examinci 
the contribution of Individual items to a test information curve, but 
they also allow for the comparison of test information curves/ It is 



-80- 

possible for a psychometrician to form different combinations of items 

(tentative tests) in the initial stages of test development and compare? 

the information curves of different sets of items at specific ability 

levels, thus allowing him/her to choose the set of items most suited 

for the purpose of the test, Marco (1977) used this technique to study 

the effect of lowering the difficulty of the Scholastic Aptitude Test, 

Item parameters (b^, Cg, ag) were determined on item data from about 3,000 

students who took the mathematical part of the SAT in December ^ 1970 and 

th*? verbal part of the SAT on January, 1971. He then selected items to 

form four teBrs* , "N, 

1* a test composed mostly of moderately difficult and easy items; 

2* a middle difficulty test having a bimodal distribution of item 
difficulties; 

3* a middle difficulty test with no easy or difficult items, and 
"4. a very easy test composed of all easy items, 
EKamination of the four test information curves showed clearly that if tiiu 
test was made etisier, discrimination in the upper or middle part of the 
ability range suffered. . . 

To summariEe, the test information curves obtained from tests devuiuped 
using latent trait models make it possible to obtain some. indication 
of the probable results of combining various subsets of items and also 
allow for compa ri sons among these subsets No such feature eKists for 
tests developed by conventional test construction methods* ThuSjmuuh uf thu 
combining of test items or altering of existing tests with conventionaJ pru- 
ccdures must be done ori an intuitive basis* 



83 



-81- = 
Oncfee triB final Cesc forms are assembled, the neKt step in tho tost 
development process Is usually test norming. Tu aonveiitlonal test inn Luch- 
nology, thlB is an expensive and time consuining procass involving testing 
larKti srimplea of examinees similar to the population the test is intended 

BecaustOatent^^trait models provide ability* estimates that are 
independent of the items selected for administration, the norming proceHs 
can be Bimplified cunsiderably * It Is not necessary for all Individuals 
to taka all of the test Items. The test can be broken up into subtests 
with different groups of students taking different subsets of items* A 
succeHsful application of this type of norming was made to the Key Math 
Diji^qstlc Arithmetic Test published by American Guidance Service. 



84 



-82- 

Tailored Tesclng 

Since the beiinning of rormal testing some 60 years ago, almost all 
testing has been done in n convontlonal fashion; thnt n Mrnup of 

individuals all take the same test. Since these individuals will vary In 
.ternis of the ability that is being measured by the test, some will find the 
test too difricult and others coo easy. Those who find the test too diffl^ 
cult may experience frustration and negative reactions, while those who 
find the test too easy will not be sufficiently motivated to put forth 
maximurn effort/ In short, the test will do a good job of measuring for 
those individuals whose ability is at or near the median ability of the 
tn:;r:. For such individuals, the difficulty level will be such that they ^ ■ 
will answer half the questions correctly and half incorrectly. A logical 
extension of this line of reasoning dictates that the test would measure 
mnxlmally the ability of all individuals in the group if It presented 
questions to each Individual that that individual could answer correctly 
half tlie time. This, of course, is not possible using one test. 

In tailored testing, ah attempt is made to "tailor" the difficulties 
I of the Lest items to the ability of the examinee being measured- This de- 
mands the existence of a large pool of items whose statistical character- 
istlcs are known so that suitable items may be drawn. The procedure does 
not J end Itself easily to paper and pencil testing situations, and hence 
the tailoring process is. typically done by computer (eKceptions to this rule 
are presonted in the work of Lord [1971c, 1971d]). According to Lord (1974b), a 
cnmputer must be programmed to do the following in order to tailor a test to an 
examinee: . - ' " 

1. Predict from the cxaminee^s previous responses how the exaniJnee 
" would respond to various test items not yet administered. 



-83- , 

2. Make effective use of this knowledge in picking the test item 
to be administered nexc^ 

3» Assign at the end of testing a numerical score that somehow 
represents the ability of the examinae tested. 

Tailoring a test to examinees will circumvent the psychological prob-- 

lems mentioned earlier. Alsoj from a psychometric point of vlewj tailored 

testing can Insure that the standard error of measurement will be the same 

throughout the ability continuum^ This is not true of conventional tests 

whure the standard error tends to enlarge for individuals at the eKtremes 

of the ability continuum* 

(a) Cla ssical Testing Theory and Tai lor ed Testing 

Early work on tailored testings making 'use of classital test theory, 
tended to focus on concerns somewhat removed from the notion of ability 
estimation for an Individual, Because of this dlfferent^^ focus, classical 
methods functioned adequately. These studies (for example, Gleary, Linn, & 
Rock, 1968; Linn, Rock, & Cleary, 1972) focused on two areas' Allocation 
of eKaminees to extreme ability groups and the capacity of the J tailored 
test to reproduce, using fewer test items, the rank ordering of examinees 
Supplied by the conventional group test. The results of these studies tended 
to support the use of tailoring strategies, and the sorts of questions 
addressed allowed the use of ♦^^^ditlonal item Indices. ' 

The use of traditional item indices no longer suffices at the individual 
examinee level when the problem of interest Is atiility estimation. Here, 
based Upon the set of test items an examinee encounters, we want to make . 
an inference as to expected performance on a large set of questions like 
those encounterad (Lord , 1974b) * This expected performance Is the ability 

ERIC 



of the examinee ineasured by the test items. Since we are concerned now 
with ability eatimation on a eingle individual, this precludes the use of 
traditional item indices in selecting items, because these statistics are 
based upon a particular norm group* A set of c^illbrntetl \ turns that 1 f. 
free from the norm or calibrating group is necessary* 

Tailoring test items to an examinee dictates that differenc eKaminees 
take different test items. What is needed are examnee ability estimates 
that are independent of the particular choice of test items, if there is 
interest in comparing one examinee with another. The solution to this 
problem and the one mentioned previously, is provided by latent trait theory. 
Classical methods are' of no value here. 

(b) Latent Trait Theory and Tailored Testing 

In order to perform the three tasks discussed by Lord, it is necessary 
to introduce the notion of item characteristic curves. This will allow us 
to predict how an examinee will perform on a new item, even if the item has 
a different difficulty level from the one previously responded to. The two 
and three-parameter logistic curves have most often been selected as the 
mathematical formfof item characteristic curves used in tailored testing 
research. ^ 

fc)' Tailored Testing Strategies 

Research done on tailored testing, whether based upon latent ti .lit 
theory or classical theory, has been built upon the following ruler If an 
eKamineW answers an item correctly, the next item should be more difficult; 
if an examinee answers incorrectly , the next item should be easier. Based 
upon this general- rule, certain branching strategies have been devised. 



-85- 



These strategies can be broken down into two-stage st ra tefttes and multd 



sjage^ strategies , 



The mulci-stage scrateglee are either of the fixed 



branching variety or the variable branching variety . 



In the two stage procedure, all examinees take a routing test and 
based upon scores on this test, are directed to one of a number of measure-- 
ment tests lying at various points along the ability continuum. Ability 
estimates are then arrived at through a suitable combination of scores 
from the routing test and the measurement test, which is usually peaked at 
a particular difficulty level* Lord (1971a) ^uses a maximum likelihood procedure 
that combines ability estimates for both the routing and measurement test 
in a fashion such that each estimate is weighted inversely by its estimated 
variance* Other combinations would also appear suitable. 

Whereas the two-^stage strategy requires only one branching solution ^ 
from the routing to the measurement test, multi-stage strategies Involve a 
branching decision after the examinee responds to each item. If the same / 
item structure is used for all individuals, but each individual can move 
through the structure in a unique ways then it is called a fixed branching 
model i ConBidering how much item difficulty should vary from item to item, 
leads to involvement with constant step sise structures (usually representad 
as pyramids) or decreasing step size pyramids* If guessing should become 
a consideration, then a possible solution would be to make step size In the 
positive direction less than that in the negative direction (Lord, 1970b), 

For these multi-stage fixed branching models, all examinees start at 
an item of median difficulty on the continuum (bj^ ^ 0) and based upon a 



88 




-86- 

correct or an incorrect response, scare to pass through a set of items thnt 

have been arranged on the basis of item difficulty. After having completed 

a fixed set of Items, either of two scores are used to give an estimate of 

ability. One score is tfie difficulty of the item that would have been 

admlnlBCered to the examinee after the n^^ (last) item. The other score 

is the average of the item difficulties, excluding the first item that 

everyone takes, but Including the hypothetical n+l^t item. Lord (1971a, 

1971b, 1974b) has demonstrated that different scores should be used to 

ostimate ability depending upon the strategy used. For co. .stant step size 

procedures (up and down methods), average difficulty score is preferred, 

while for variable step--si^e procedures CRobbiris--Monro methods) ^ the final 

dlf f iculty score^ should be used. 

The variable branching sfcrategies are multi-stage strategies that do 

not operate with a fixed item structure. Rather, at each stage of the pro- 

cess, an item in the established item pool is selected for a certain examinpi;* 

in a fashion subli that the item will maximaMy : educe the uncertainty, of Lhu 

examinee's ability estimate, if administered. After administration of the 

item, the ability estimate is either recomputed using Bayes Theorem (Owen, 

1975), or recalculated using the maximum likelihood procedure. A normal 

prior on ability ia assumed for the Bayesian method and the administration 

of items is terminatGd when 0^^< adsigned^ value , where o ^ is the posterior" 

m • 

variance of the ability estimate after m items have been administered. For 
the maximum likelihood procedure, item administration ceases at a set ' 
number or when the standard error of the estimate Is < a prescribed value 
for the last item administered* - , ' 



-87- 



(d) Scudlas Using Latent Trait Theory 

As discussed in the previous section, there are a number o£ ways in 
which examinees can be presented with items tailored to their ability, and 
there are also a number of ways of computing scores to estimate ability, 
based upon Item difficulties. What is also needed is a means of evaluating 
results obtained from various procedures. The mechanism for evaluation should 
noc.be- based on group stacistlcs such as correlation coefficients because 
the crux of the situation is to determine the accuracy with which we can 
measure ability for a single examinee* Most of the studies on tailored 
testing to date have made use of test information curves. 

Lord (1971a) compared the test Information curves obtained for various 
two'-step procadureB with a teBt information curve provided by a conventional 
peaked test (which he calls the standard tesc) * His conventional test pro^ 
vided maximum information for scores; at median ability level of the con-- 
tinuum (b-0) ^ and decreasing information for scores deviant from median 
ability. Specific results of tte study and others that he did (for 
example, Lord, 1970b, 1971b) cannot be summariEed briefly because of the 
multitude of test designs and strategies that he studied. WlSat is clear 
is this^ The tailored procedures provide more information at the extremes of the 
ability dlHtribution than does the standard test * and provide adequate 
information at the median difficulty and ability level (b-Q) , where the 
standard test cannot be surpassed. 



90 



Studies using the vflriable branching models will not be discussed in 
this paper. This is because it is very difficult to compare the results 
frorn these strategies among themsel^^es, let alone with the fixed branching 
models. Readers are referred to Owen (1975) and Wood (1973i 1976a), 
Weiss (1974 4 1976) and Vale and Weiss (1974), in their reviews of the stra^ 
tegies and relevant studies, also summarize some of the results of tKese 
procedures . 

(e) Final Comments - 

The work by Lord and others in introducini latent trait models to 

explain or predict examinee performance in individualized testing situations 

represents one of the most successful of the applications of latent trait 

theory to date. Of course^ much work remains to be done. For example, 

1, It is unclear as fo which of the various scoring methods^ be it 
final difficulty score, average difficulty score , or any of the 
other possibilities, gives the best statistical approximation 
of ability. This is especially a problem when the number of test 
items administered is small, 

2- The, .present models do not deal well with the effects of guessing. 
V Since tailoring strategies minimize the number of items too 

difficult for an examineey guessing should be reduced and 
any guessing that goes on probably can't be considered random. 
Wliat is needed is an Investigation of the exact effects of 
guessing on tailoring strategies for ability estimation. 

The above list of probrems and/or research areas are not meant to be 

all-lncJuslve. Wood (1973)> Green (1970), Lord (1977a), and Weiss (1974) 

all offer further suggestions for research . 



Interest In individualized instruction and testing, has brought to 
light the need for item banking (Choppin, 1976; Wood, 1976b). An item 
bank is a collection of test itetns, ^'stored** with known item characceristics 
and made available to test constructors. According to the intended purpooe 
of the test, items with the desired characteristics can be drawn from the 
bank and used to construct a test with known properties* ' 

Unfortunately, classical Item statistics (Item difficulty and dis- 
crimination) are of limited value for describing the tast items in the bank 
because they are dependent on the group of examinees from which they came. 
On the other hand, latent trait item parameters do not have this limitation 
and therefore are more useful for describing test items In the bank, 

A practical problam facing many test constructors is that of building, 
over a period of several years, a pool of Items to be used In constructing 
test forms. Because of the time span, newly written Items will need to 
be pretested on groups of examinees different from groups used to pretest 
other items in the pool. Because of the invariance property of the latent 
trait item parameters, even though two pretest groups may be quite dissimilar 
in ability, there are few problems in obtaining Item parameters that are 
comparable across these groups. Let us assume that we are interested lu 
describing Items-by- the two -item-^parameter^^^ logistic - " 

test model. The one serious problem is that because the mean and standard 
deviation of the ability scores are arbitrarily established, the ability 
score metric is different for each group. Since/ the item parameterB d^^pend 
on the ability scale* it Is not possible to directly compare latent trait 



-90- 

item parametors darivGci From different groups of cfxamineys until the ability 
scales a r^^ equated In some way. Fortunately, the problem Is not too hard 
to resoLve sinre Lord and Novick (1968) have shown that the items parameters 
in the two groiips are linearly related. Thus, if a subset o£ calibrated 
items is administered to both groups, the linear relationship between the 
estimates of the Item parameters can be obtained by forming two separate 
bivnriate plots, one establlHhing the relationship between the estimates 
ot the item dlscriTnination parameters for the two groups, and the second, 
the relationship between the estimates of the item difMcuity-parame 
Having established the linear relationship between common item parameters 
in the two groups, a prediction equat: can then be used to predict item 
parameters for the new it ems had they been administered to the first 
group". In this way, all item parameters can be equated to a common group 
of examinees and corresponding ability scale. No such linear relationship 
eKists between the classical model parameters. 



-91" 

I tem Bias 

The notion that certain Items in a test may be biased toward certain 
minority groups is becoming a matter of concern for the testing community. 
The concern for test bias and therefore item bias essentially stems from 
litigation involving the use of tests to classify minorities for employ- 
ment and educational opportunities. The problem here is properly called 
test fairness, but in order for a test to be fair (in usage), it is neces-^ 
snry but not sufficient that the Items be unbiased. Test bias refers to 
Che psychometric properties of a set of test items or scores; test fairness 
is concerned with the way the test is used in a particular situation* Thus 
it would seem that a first step in investigatirtg how tests are being or 
not being used fairly with minorities is to investigate item bias. 

Investigations of item bias using classical test theory have not been 
successful. One reason for this has been offered by Pine (1976). Bias in 
testing is caused by the inability of tests to consider individual differ- 
ence variables^ such as motivation and ethnic background. Investigations 
of these variables using classical test theory will further perpetrate 
the problem; namely that we are using a group based approach, whether in 
the test or in the bias study, to try to investigate individual difference 
variables. We create a situation of bias and then try to use the mechanism 
that created the sicuation in the first place to investigate it. 
.. V What is item bias and why have traditional explanations for item 
bias led to procedures of minimal usage? The most extreme stance on item 
bias is that a test is biased to the extent that the means of the two 
populations considered art different . The problem here is that other 
variables besides item bias .contribute to these. mean dif^ ^ 



-92- . , 

As Hunter (1975) says, it is not so much that the test Is biased, but that 
there is bias In the learning environments that help determine the test 
score. The notion of matching will not help; it would be impossible to 
list all relevant variables upon which to match. Noteworthy is that, from a 
latent trait point of view, this lack of educational equality of experience 
can be viewed as a problem of dimensionality. ExperienceSi that one group 
has had benefit of^ expand the dimensionality of the underlying structure 
for that group in comparison to the other. 

Taking the mean difference notion one step further doesn't help. If 
we suppose that we have a perfect unidimenslonal test without biasj then 
the difference between the means of groups should be consistent over Items. 
There would be no group by item interaction. If in an analysis op variance, 
a group by item interaction should, prove to be significant, it has been 
advanced that this fact is a demonscration that the items are biased. 
However, Hunter (1975) has clearly pointed out that a perfectly unbiased 
test can show such interaction. Items of varying difficulty demonstrate 
an item by group interaction. Thus, it would seem that dealing with item 
difficulties would be the next step, but there are problerns with using the 
classical definition of item difficulty as an Indicant of item bias. 

A classical definition of item difficulty would re fer~to"t^^^ 
of correct answers given to an item. If the item difficulty were the same for 
both groups, it has bevn advanced that this would be demonstration that 
the item was unbiased. Lord (1976) has noted chat one could plot those 
prDportlons for items on a-test for and fltf-^the' resulting 

scatterplot with a st^^aight line. Departure from linearity would then -.~ 
seem to be a good indicant of test, and one step fuL'therj individual item 



# -93- 
bias,. Lord clearly polncs out that tho failure of points to fall on a 
straight line doas not mean chat there is test and item bias/ He states 
thu following reasons for his stand- 

1. There is no good reason for the points to lie on a straight 
li^^ the first place. If one group consistently outperforms 
the other, the relationship must be curved. Further, while 
straightening the line of relationship by using the inverse 
notinal transformation (and perhaps^ further transforming to d 
values), does straighcen the line, there are still further causes 
for problems, 

2. If the questions can be answered by guessing, even using the 
inverse normal transformation is uDt going to assure that the 
points will lie on a straight line unless the groups performed 
equally well on the test* 

3. If guessing weren't a problem, the discriminatiqn ind^x of an r : 
it am would be. More discriminating items would produce mo^ 

a difference between groups than less discriminating items - Items 
of the same discrimination would lie along the same line, but 
there is no assurance, without building equal discrimination into 
the situation or model, that this is the case/ 

Thus, while we would want no other variables to keep points from lying along 
a straight line than item bias, using proportion correct will not assure that 
the situation will be so. Lord (1976) then demonstrates in a quite clear 
and simple fashion that the proportion of correct answers (classical item 
difficulty) is not really a measure of item difficulty. Stated simply, we 
would want the iteni difficulty to be independent of the people used to deter^ 
mine the index; this is not possible using "proportion-^corract*^ as the-^indexr 
Mead, and Draba (1976) and Hunter (1975) of fer further discus-- 
slons about the problems Inherent in using group-based statistics as 
indicants of item bias or test bias. Factor analytic approaches , whereby 
factor structures for the groups are compared, or the use of item'^test 
point biserlals suffer from the same problein as proportion correct; the 
indices are dependent upon the group from which the measures were estab- 
lished. . r ■ 



If all of the traditional indices, which not only describe the test 
item, but also the group tested, are of questionable use in dealing with 
test bias, what can be done? A useful index would have to be free of tht? 
group used for defining it. This "sample-invariant'' property does exist 
for latent trait model parameCers, 

Using latent trait theory rather than classical test theory, we can 

formulate a definition of item bias in a different fashion* According to 

Pine (1976) : ■ " 

A test item is unbiased if all individuals having 
the same underlying ability have an equal probability 
of getting the item correct, regardless of subgroup 
membership. 

This means that item characteristic curves which provide the probabilities 
of correct responses must be identical across different sub^populatlonB of 
interest. Taking this one step fu/^ther, if the item characteristic curvwi.^ 
are the same, then the item parameter (s) have the same values for the 
subgroups, up to a determinable linear transformation. If the subgroupB 
upon which the parameters are calibrated differ in means and variances^ 
then a linear transformation will be necessary to equate scales* If the 
trans formation is not applied^ the parameters will be linearly related 
for the subgroups (assuming no bias)* 

How does one proceed? At least three solutions are currently being 
studied. Lord (19/6) is developing a statistical' test for deciding whether 
the item characteristic curve for an item is the same for the subgroups 
involved. Pine (1976) discusses first test for unidlmensionality , anil 
also describes a possible method for correcting for item bias by adjusting 
item parameter estimates. Pine and Weiss (1976) take items of varying item 



97 



-95- 

blas and iook at how this affects three test fairness models — the Cleary 
model /the Thorndike model, and the model basad upon a validity correlation 
with an external criterion. Wright, Mead and Draba (1976) and Mead (1976), 
utilizing the Rasch model,, develop, through the use of residualB, an AN OVA . 
approach to detecting item bias* 

A study described b]^ Lord <1976) iS now in progress at ETS. Note^ 
worthy is that he advances a two step approach to the detection of item bias 

1. Plot item difficulties for the subgroups on the same graph, 
and fit the plotted points with a straight line. This will put 
all items on the same reference scale, and aberrant items will 
demonstrate signif leant departures from linearity. 

2. Test the hypothes;ts that the aberrant item has the 
same item characteristic curve for the subgraups 

of interest* 

Pine (1976) suggests a two step procedure like Lord's, but he adds 
one additional step; namely testing for unidlmnnsj onality of a set of test 
items* If the trait dimensions ate the same, any variability in parameter 
values can be attributed to Item bias* However^ it was mentioned earlier 
in Che paper that factor analysis of tetrachoric correlation matrices has 
problems associated with it # It remains to be seen how useful in practice 
this step will be, 

Wright, Mead, and Draba (1976) and Mead (1976) utilise the "simpler" 
nature of the Rasch model to develop a very interesting approach to studying 
test and item bias. They first form a residual, i.e. the difference 
l5etween observed outcome on an items and eKpected outcome based upon the 
model, and then transform metrics from the proportion metric to the ^ability 
mutrLc. Us t tig residua J. s on the ability metric, they are able to set up a 
weighted least squares ANOVA for testing shifts in item difficulty across 
subgroups, which in the Ra£i<!]i model, would be the solo indicant of item 

98 



-96- 

bias. Mead (1976) also discusses a graphical method whereby residuals 
are plotted; against the ability scale. The residuals plocted against the 
abiliEy scale fall along a horizontal line through the origin* Any dis-^ 
turbance, such as guessing, discrimination differences caused by practice 
or speeds or most impdrtant here, item bias, will appear as a departure 
from the horlzoimcal , The shapes of deparcures would then indicate the sort 
of disturbance present- 

in summary, the app-lication of latent trait models for the detection 
of iteni bias Is just now beginning* As with the field of tailored testing, 
classical tef>t theory will not solve the problem of interest* Certain 
transformations of the classical indices will help in curtailing some 
problems, but one can never escape the dependency upon population character^ 
istlcs* As such, any indication of item bias can't be read in a pure 

fashion; it could rflso be the result of another variable, such as guessing, 
which classical Indices cannot control for. 

The areas for further eKpansion and research have been well defined 
eisewhere* These include* 

1 . Development of n method for correcting item parameter values to 
account for hi at! in the item* This would seem to be of value 
in eliminating the effect of bias in the item rather than elimin- 
ating the item itself* Fine is presently working on such techniques 

2... A further study of the effects of item bias on other test fair-- 

nes models other than those investigated by Pine and Weiss (1976)*. 

3* Further study and documentation of the ANOVA of residuals 
method developed by Wright et al. (1976), 



99 



^97- 

In conclusions while the use of item characteristic curves for 
'-d&tecting item bias is in the beginning stages, it appears that the 
critical areas of concern are now being Investigated^ The months ahead 
will bring evidence as to the feasibility and practicality of the use of 
these methods. 

Test Equating 

Large scale testing situations often dictate the need for niultiple 
and interchangeable forms of the same test. Test construction techniques 
do not assure that two (or more) forins of a test can be made equivalent 
in level and range of difficulty ^ and hence there is a necessity for test- 
score equating. In equating the forms ^ the system of units of one form 
is converted to the system of units of the others so that scores derived 
from the two forms* after convers ion, will be equivalent (Angoffj 1971)* 

The advantages of equating test scores is that one can study and mea^ 
sure growth, using equated forms, can merge data when the data is derived 
from different forms of a test* and perhaps most importantly^ equating 
allows comparison of performance of two individuals who have taken different 
test forms* 

Two Borts of stipulations or reatrictions involving the equating 
process can be exclamated"> 

1, The tests that are to be equated must be measures of the same 
characteristics Teste measuring different traits or abilities 
cannot be equated. 

2, If equating is to be a transformation of only systems of units, 
the transformation must be unique (except for a random error 
component). By this la meant that the transformation must not 
be sltutitlon specific, but be independent of the individuals^ 
from which the data were drawn to perform the conversion, and 

he npplicablD to other situationii. 



100 



The extant literature In this field can be roughly separated Into 
three areas: Angoff's explication of the field (prior to the use of latent 
trait theory), the Rentz and Bashaw work (1975* 1977) on equating using 
tha Rasch model, and Lord's work (1975a, 1977b). 

The methods described by Angof f are adequate for handling parallel 
tests that are to be equated. Lord's work essencially deals with non- 
parallel equating situations, and his V9f5 study contrasts situations' 
where raw score machods using equipercencile equating can be used, to the 
use of Item characteristic curves for the same situations* 

Thtre are esBentlally three distinct ways of collecting data for an 
equating project! (1) Administer the two tests to the same group of indi- 
viduals, (2) Administer the two tests to two equivalent groups of Individ- 
uals, where the groups are set up by random sampling or (3) Administer the 
two testa along with an anchor test to two groups that need not be equi-- 
valent. The anchor test, which is administered as a part of both of the 
non-parallel forms to be equated ^ measures differences from equivalence 
in the two groups. The anchor test should demonstrate a high correlation 
with the two tests to be used in an equating study. 

Besides the three methods of data collection as mentioned above, 
there are also two methods of non-linear equating* One method is the 
equipercentile method using raw scores (Angof f, 1971), For 
non-parallel tests, the true scores on the two tests will have 
a non-linear relationship, and because of this, the standard error 
of measurement for the equated test "will probably not be equal to the 



101 



-99- 

standard error of measurement of the test being equated tOi for the entire 
score scale. This is critical to equating, and if the standard errors are 
not the same^ raw scores cannot be equated with the assurance of strict 
interehangeability. 

The other method of equaclng is based upon ability estimates using 
item characteristic curves* Lord (1975a) points out that if we are willing 
to equate on ability estimates and if the theory holds s it models the non^ 
linear relationship eKactly. This means only linear relationships would 
need to be dealt with in equating* 

Using data from the Anchor Test Study (Loreti Seder^ Blanchini^ 

Vale^ 1974) i based upon a single group of individuals who took both tests 

to be equated s Lord demonstr ^d that equating using item characteristic 

curves formulated an equating line that closely coincided with the line 

I 

developed by equipercentile methods. Using the LOOIST program (Wood et al,, 

1976) s item parameters and a single ability estimate for each Individual 

were obtained by combining forms. Then estimated true scores T were 

found for each test form from the relations ^. 

T - Z Pg(e) 

where P (0) is the three^parameter logistic curve estimated by 
LOGIST* These estimated true scores were then equated, and the method was 
found to closely coincide with equipercentile methods using raw scores* 

In Bxim^ Lord's studies^ Involving a single groups demonstrate that true 
scoi^e equating and equating using the estimated distrlhytion of observed 
scores closely coincide with the conventional method of equipercentile 



102 



equating of raw scores- It remains co be sean which method is practically 
most advantageous, but from a coinputer time point of view, tha conventional 
inethod would seem more practical if item parameters have to be estimated 
for each item. If this had already been done, the decision would be less 
clear. 

Using another data set^ f rom the Anchor Test Study, where representa- 
tive and equivalent samples took one of the two tests. Lord was able to 
equate the testsusing a number of methods. These included* 

1. Because there were no overlapping students (as in the single 
group) or overlapping items (as in the Anchor Test), there was 
no way to get a single ability estimate across both tests for 
an individual. Therefore an ability estimate was gotten for 
each examinee on the respective test he/she took, and the 
ability estimates were equated using the equipercentlle method. 
An advantage of such a method is that when the two tests 
measure the same ability, the ability estimates have a straight 
line relationship under the latent trait model used (the raw 
scores would not) '. This allows easier extrapolation at the ex^ 
tremes of the distribution, where data is often scarce. 

2. Using the straight line plotted to the ability estimates and 
using an inverse t^an.^f ormation twice (see Lord, 1975a), the 
curvilinear relationship between true scores may be obtained 
and the scores equated* 

Thus, for equivalent groups^ equating using ability level offers a 
distinct advantage in that the line for equating will be straight* Other 
ways of equating (using estimated true scores, estimated distributions of 
observed scores, or equipercentlle equating using raw scores) have curvi^ 
linear equating lines. It is as if by using ability estimates for equating, 
we are reducing the equating problem for non^llnear tests to one of llnaar 
(parallel) tests. 

The third and final method of data collection for equating tests 
involves an Anchor Test, Because Items overlap, one ability estimate 
can be obtained for each examinee and then estimated true scores T are 



103 



compuced arid equated as in the second design. \ Also, an estimated frequency 
distribution of raw scores can ba obtained and equated as with tliG first 
method. Both mfsthods were compared to the equipercentlle method uHing raw 
scorus, and there was less coinciding of the equating lines derived from 
item characcaristic curves with the raw scores than before. Lord offers 
an explanation; 

The conventional equipercentile equating of two tests to 
an anchor test is an Inefficient, and strictly speaking, 
a biased and inadequata equating procedure for groups 
that differ in ability level* 

Tiuis, it would appear that in this Hicuation equating using item 
characteristic curves is a necassity. When the tests are not parallel, 
and the groupB are not equivalarit, it would appear that item character-- 
istic curve methods are the only adequate way of ascertaining equality of 
two tests. 

In summary, when a single group takes both tests, it would appear that 
thjj use of latent trait theory would be advantageous only if item para- 
meters have already been estimated. Results appear to coincide for raw 
scare equating and equating using item characteristic curves and the 
decision about method will probably be based upon computer use. 

When equivalent groups take the two tests, latent trait theory equating 
offers a distinct advantage if the equating is done using ability estimates, 
for the equating line will be straight and extrnpulatlon problems* minimized , 
Any other methnd of equating using item characteristic curves seems to 
ol fer no advantage over conventional methods* 

When nn anclior test lb used Cor non^equival ent groups, item charactur= 
istfc curve oquutlng Ls the onJy justifiabla method to use. 

104 



-102- 

So far, we have said little about the use of the Rasch model in equnt- 
ing tests (Brlgmnn & Bashaw, 1976; Rent?. & Bashaw, 1975)* The following 
points can be made: 

1. The papers by Lord deal with the usu of gencrnl LCem character' 
istle curvas; that is, item parameters are not restricted. From 
this point of view, use of the Rasch model can ba viewed as a 
special case of Lord's work. , 

2. The items must fit the assumptJons of the Rasch model. If they 
do not, it would seem a necassity that a discussion of the uses 
of other latent trait models be presented, 

3. The Rasch procedufe is based upon obtaining equating constants 
for the two tests (see, Rentz & Bashaw, 1975) . Two methods exist 
for doing this, the item difficulty method and the ability meLlicid. 
In either case, it is necessary that the same group of individuals 
take both tests. Thus, the procedures can be viewed as a subset 
of our discussion of data collection method one above. While 

the simplicity of the Rasch equating procedure would seem to 
warrant its use, it can onlybeused for test Items that fit the 
model and under situations where the same group takes both tests. 

fn Rent^ and Bashaw (1975), the authors conclude that equating usin^ 
the Rasch model involved an equating line that closely coincided with tho 
conventional method. They also mentioned that the Rasch procedure 
invnlvud lusa cime, effort, and money (discusBed ao savings). Two c-™nu,nts 
Keen, ^pproprince: (1) The rem.Us confirm the results of Lord's study 
using a sinKle %roup, and (2) The mentioned saving, may have been partlnlly 
■an nrtifaec of the eompleKltj^ of the equating study. The Rasch study wa« a 
reanalyslB of the data from the Anchor Test Study, which Is of a complex 
nature, involving multiple equatJngs. It is not really known at present 
whether t.quatlng using Item characteristic curves on a single group, using the 
Rasch model or otherwise, always affords a savings over conventional methods. 
The mentioned savings may in fact be situation specific. 



-103- 



The present state of test equating would seam to be well explicated. 

Unlike some of the Jther applications of latent trait theory, like talLored 

cesting, there are conventional methods, not using latent trait theory, 

that work well in a variety of situations. Those areas where latent trait 

models offer explicit advantages have been discussed* Lord (1975a, i977b) 

does, however, briefly indicate two areas that need further works 

1* There needs to be more studies done using item characteristic 
curves in equating, and particularly in the comparison of 
equating methods using different item characteristic curve 
models to conventional methods. 



2* If two tests are not parallel to begin with (i*e. havea non-'linear equating 
curve), one is forced into the logic that the tests are not 
equally reliable for all subgroups of examinees* Thus, by 
definition, It Is not proper to equate raw scores* Faced with 
a choice of exact true score equating or inexact raw score 
equating, one finds no criterion for choosing which to use* 
A Set of criteria would need to be developed for this and other 
situations when a procedural choice must be made* 



106 



-104- 

Estimation of Power Sr-ores 

A speeded teat is deflnad as one for which examinees do not have 
time to respond to some questions for which they know the answers* A 
power test Is one for which examinees have sufficient time to show 
what they know. Most academic achlavement tests are more speeded 
for some examinees than for others. 

Occasionally the situation exists when a test, that is intended 
to be a power tests becomes a speeded test* An example of this situ-- 
ation is a test that has been mistimed^ i.e, , examinees arr. p/iven 
less than the specified amount of time to .complete the test. In this 
situations it would be desirable to estimate what an examinee's score 
would have been if the test had been properly timed. This score is 
referred to as an examinees power score ■ 

Power scores are not difficult to obtain if the test items are 
all of equal difficulty and equal discriminating power* An examinee's 
expected Item score on each unanswered item would equal the ex.'rminee"S 
proportion-correct score on the items that were attempted. However, 
if items vary in difficulty or discrimination, another method is 
needed. Lord (1973) has discussed a method using the three^parameter 
logistic model and applied it to the estimation of power scores for 
21 examinees who had taken a mistimed verbal aptitude test. 

Lord's method requires not only the usual assumptions of the 
three-parameter logistic models b^t also it assumes that the students 
answer the items in order and that they respond as they would if 
given unlimited time, l,e., if given more time, they would not go 
back and change any of their answers. 

107 



-105- . -. .: 

If the test score, x, is the number of correct answers, the 
expected power score for an examinee with ability level, 8, for a 
set of n items is equal to the sum of the exanilnee's probability of 
answering each Item correctly. , The probabilities are obtained from 
the item characteristic curves. Therefore, if there is sufficient 
data to estimate an exaininee's ability score, and the item charac- 
teristic curve parameters are known (or can be eat Una ted) an exam- 
inee's power score on the n test items can be astlmated: It is 
equal to the examinee's test score on the attempted items plus the 
examinee's expected score on the unanswered items (found by summing 
the examinee's probabilities of answering each unanswered item cor- 
rectly). Suppose k is used to designate the last item attempted by 
an examinee, x is the examinee's score, n is the number of items In 
the test, and § is the examinee's estimated ability derived from the 
k items attempted by the examinee. The examinee's estimated power 
score is given by 

n 

X + S p_ CO) . 

g«k+l ^ • 

Lord (1973) reported the following application of his method. 
Item parameters of the 90 verbal aptitude items comprising the mis- 
timed test were estimated using responses obtained from 944 students 
including the 21 mistimed students. Abilities were estimated for 21 
students froni their responses to the items excluding responses to 
any unanswered items at the end of the test. Power scores were estl- 
mated using the method described above. 

Lord felt his method could be justified empirically if the fol- 
lowing properties of the estimates could be demonstrated T 

108 



-106- 



Estimates of item parameters from one group of examirieea 
closely approximate estimates of the iame item parameters 
from other groups of examinees* 

Estimates of ability parameters from part of a test cloaely 
approximate estimates obtained from the entire test. 

The power score of an eKamlnee on a test can be accurately 
approximated from his ability eatimate as estimated from 
the .same test, " 

In Lord's Judgment, the available evidence has been quite favor- 
able: ■ ■ 

/I, Lord (1970a) showed good agreement between estimates of Item 
characteristic curves from two different groups of examinees, 

2. Correlations over .94 were obtained between ability estimates 
derived from different subsets of items In one study of SAT 
response data. 

3. The correlation between power scores and number right scores 
has exceeded .98 in two different studies. 

Lord cautioned that a wide variety of empirical checks would 
have to be carried out before one could be sure of all the clrcum-" 
stances under which the three properties of the estimates listed 
above would hold. 



1. 

2/ 
3, 



109 



... 

l}^ Computer Programs 

How can test practitioners use latent trait models in their work? 

*■ ■ . 

/ ' -I, ■ ' • ■ 

/ Fortunately, there-are a nunAer of c-^mputer programs available for 

estimating ability and item parameters (Hambleton and Rovinelll, 1972; 
Kolakowski and Bock^ 1970; Wood and Lord, 1976| Wood, Wlngersky, and 
Lord, 1976| Wright and Maad, i976a,1976b| Wright a Panchapakesan, 
1969). Some details on four of the computer programs, LOGIST, CALFIT, 
BICALv/ and DATAGEN, will be provided next, 

LOGIST (Wood and Lord, 1976) allows the user to estimate exam- 
inee abilities and all parameters of the three-pararaeter logistic 
model, A maximum likelihood method is used to obtain estimates of 
the item and ability parameters (Lord, 1974a) . The item and ability 
parameters are estimated simultaneoualy. For the estimates of the 
parameters to converge, various restrictions are placed oh the para- 
meters being estimated. Ability estimiates are scaled to have a mean 
of zero and a standard deviation of one# _ , 

The following statistics, reported by Wood et al . (1976), give 

some idea of the computer time required for running on an IBM 360-65. 

' ...... • . ■ , , ■.. 'W^^ -■ y ' . 

A test of 60 Items and 5305 examinees took approximately 230 seconds 
per complete stage. A complete stage Involves the estimation of both 
ability and item parameters^ A test with 85 items and 2269 examinees 
took approximately 13Q seconds per complete stage. Convergence was 
obtained after 10=15 stages* To achieve convergence, certain restric- 
tions are imposedr For example, (1) abilities for examinees with 
zero scores, perfect scores, and those who answered less than 1/3 of 
the items, are not estimated; and' (2) an upper bound value is Imposed 
on the estimated discrimination parameters. 

■ ., :.:;iiov.. .■■ . ' 



"108r 

Wood, et al , (1976) provide a complete description of tlie out- 
put from the program after each stage and after the job Is completed. 
The output after the final stage Is completed includeai (1) Final 
item and ability estlmatesi (2) a summary containing various statls* 
tics for each stage; and (3) the total time for the run. 

According to a write-up (Wright and Mead^ 1976a) on BICALj '-The 
BICAL program estimates the parameters of the Rasch model when the 
underlylLig response process is binomial , . . The algorithms used 
for estimating item difficulties and person abilities are the cor- 
rectad unconditional maximum likelihood procedure and a normal approx 
imatlon * , , In addition to estimates of difficulty and ability ^ 
and tests of item fits output includes the standard errors associated 
with these estimates j residual indices of Item discrimination and 
the degree of convergence of the estimation procedures*" BICAL con= 
tains a data simulator which can be. used to verify the functioning 
of the program or to provide an appropriate random background for 
the Monte Carlo analysis of unusual data. 

The CALFIT program has also been described by Wright and Head 
(1976b). This program performs 4 major tasksi (1) Data input and 
description; (2) data editingi (3) estimation of parameters; and 
(4) analysis of fit. 

The output includes' (1) The distribution of examinees by total 
score; (2) the results of the estimation procr*3s; (3) the number of 
iterations required for convergence | (4) the analysis of the fit of 
the data to the Rasch model; (5) a summary of the fit information 
in three sequences; serial order, difficulty order^ and fit order; 
(6) a plot of the e2 statistics, used in the fit analysis, against 



111 



Che probability of a person In an ability group answering the item 
correctlyi (7) a plot of the item fit mean squares against item dlf» 
ficultyi (8) a plot of the item fit mean squares against the Index 
of item dlscrlTnlnationi and (9) a plot of , the Item discrimination 
index against Item difficulty. 

Hambleton and Rovinelli (1972) have produced a computer program 
(DATAGEN) to simulate examinee item response data from logistic test 
models • toe purpose of the computer program Is to allow users the 
opportunity to study relationshlpa among item and examinee ability 
parameters, logistic test models, and test score characteristics. A 
second purpose of the computer program is to produce test data with 
known characteristics ^ so that robustness studies , studies of estima- 
tion methodsi studies of scoring methods, and so on, can be conducted 

The program Is designed to produce a set of response patterns 
and test scores to represent the performance of N examinees on n 
binary-scored items. By appropriate choice of Item and ability para-- 
meters in the program, it is possible to produce a set of response 
patterns with a distribution of test scores approximating desired 
meauj variance, kurtosls and skewness valuea. Description of the 
item parameters In the logistic test models used to generate the 
test data are described by Lord and Novlck (1968) and Hambleton and 
Traub (1971). 

The user reads in specifications for the distribution of item 
difficulty, discrimination, and guessing parameters and ability para- 
meters, "Parameters may be selected from either a uniform distribu- 
tion with specified upper and lower bounds , or a normal distribution 
with a specified mean and standard deviation. The user also speci- 
fies the desired number of examinees and items , and starting numbers 



-110- ■■ 

for the random number generator, 

Output from the program Includes desired descTlptive statistics 
on the item parameters and estimated values on the basis of sample 

data; a listing of the item parameters and estimated conventional 

'"i" .... ' _ "■ ' " - ■ ■ 

item parameters calculated^ from the generated test data. Also repor- 
ted is a complete set of summary statistics on the generated response 
patterns and test scores. Response patterns may be either saved on 
a data tape or punched out on computer cards. 

The program is currently designed to generate response patterns 
on up to 100 items although the number of i.::ems can easily be in- 
r,reased by changing a few dimension statements. The program is 
practically machine independent except for the random number gener- 
ator, 

: - - / . ; . 



113 



Final Comnienti 

The gonl of this paper has heen to review the developments in 
latent trait theory to dati; to demonfenrate the applicability of 
latent trait theory modela to apeGiflc TneasureTnent problemi, and 
finally, to point out the advantagea of the latent trait theoreti- 
cal approach over the classical approach for the solution of mental 
measurement problems. However, the latent trait theoretical models 
are, in general^ mathematically more oompleK than the classical test 
models,, require strong assumptions that may limit their applicability 
to mental data setSj and, in some cases p pose problems that are , as 
of yet, unresolvedt 

As pointed "but in the paper, the latent trait models have num-- 
erous advantages over the classical test models^ Perhaps the most 
important advantage of latent trait models is that it is possible 
to estimate an examinee's ability on the same. ability scale from any 
subset of items that have been fitted to the modelf This implies 
that the ability of an eKamlnee can be estimated Independently of 
the particular choice or the number of items and hence represents a 
major breakthrough in the^ area of mental measurement * A consequence 
of this fact is that examinees may be compared with each other even 
though they may have taken quite different subsets of items. This . 
feature makes latent trait models indispensable to the field of 
tailored testing where examinees receive test items that are matched 
to their ability level. In such situations the items administered 
to different examinees will not be matched on difficulty, and hence 
the usual test score metric will not permit meaningful comparisons 
of examinees* Latent trait models take into account the; difficulty 



lavel of the items and reflect this in the estimates of the ability. 
. ihus, the eatimateg of the abilities of two examinees, who receive 
identical scores on easy and difficult .subtests, may differ, and henc 
a meaningful compaifieon of the eKaminees is possible. A furt/her ' 
consequence of the fact that ability can be estimated indepen- 
dently of the choice of items Is that, equating scores of tests that 
measure the same ability is possible. In addition, the problem of 
constructing parallel forms of tests is eliminated. 

Another advantage of latent trait models is that the item 
parameters are invariant across subgroups of examinees chosen from 
a population of eKaminees, Item parameters, such as item difficulty 
_ and discrimination, derived from classical test theory models are ^ 
not invariant across subgroups. They are defined for a particular 
group of Interest and will depend on the average ability of the group 
being tested. Hence, despite their computational ease, classical 
item parameters do not permit meaningful comparisons acrosr differ- 
ent populations of interest. Item parameters based on latent trait 
models, on the other hand, permit comparisons across different popu^ 
lations of interest and consequently are of Inmiense value to test 
developers. In particular, Invariant item parameters are of fund- 
amental importance In^che developmenc of Item banks and in detecting 
item bias. 

A further property Inherent in latent trait models not exhibited 
by classical test models, is that it Is possible to measure the pre^ 
cision of the ability estimates at each ability level. Thus, Instead 
of providing a standard error of measurement that applies to all 
exaTOlnees regardless of test scores, separate estimates of error for 
each-examinee or at each ability level are available- through the" ' ' 



latent trait models. 

Despita these advantages, there are iaveralunresolvad issues 
which need further Investigation, Since latent trait models require 
strong assuinptlons, the question that naturally arises is that of the 
robustness of the latent trait models* Robustness refers ±o the 
extent that data can deviate from underlying assumptions of a latent 
trait model and stiU be fit by the model. The studies reported to 
date have often produtW different conclusions (see for example, 
Hambleton [196g] and Panchapakesan [1969]) . Researchers have reached 
^if^^^i^t eonclusions because they have used subjective methods to 
interpret the results of robustness studies. It is obvious that the 
assumptions of any latent trait model will never be completely eatls^ 
fled by any data set. Hence/ the important questions are whether 
latent trait analyses provide usefu! summaries of test data ^ lead to\ 
better test score Interpretations j and can predict appropriately 
chosen criteria. When the last question was studied by Lo^d (I974a)^ 
he obtained excellent predictions. However, the issue of robustness 
is not completely resolved as of yet and further work is clearly 
needed to resolve these issues* 

The major problem that remains to be solved is that of estima-- 
tion of parameters in latent trait models. As pointed out earlier, 
the simultaneous estimation of item and ability parameters in latent 
trait models leads to difficulties. In addition, th^ ^sClmates of 
the item parameters J especially that of the guessing parmneters 
^ill Tiot be stable if examinees with a wide range of abilities are 
not used. Furthermore^ current estimation procedures require a 
large number of examinees and Items before stable estimates can.be 



bbtalnedV a problem similar to that of estimating parameters in 
regression models* The numerical problems associated with the est 1*=- 
matlon procedures present another area of concern. 

Further research Is clearly needed In the above areas. Although 
it may not be possible to show that the maximum likelihood estimates 
of item and ability parameters possess optimal properties, these 
estimates may approximate the ideal estimates in some situations. 
For instance, the comparison of the unconditional estimates and the 
conditional estimates of the item parameters in the Rasch model 
(Wright and Douglas, in press) has provided a meanljigful insight into 
the nature of the estimates. These comparisons can be carried out 
for the two-- and three-parameter logistic models, (In this connec- 
tion^ it should be pointed out that unconditional estimates in the 
sense of Bock [1972] have not been obtained for the threes-parameter 
logistic model*) Finally j^^the feasibility of Bayeslan procedures 
should be Investigated more fully. Incorporation of prior Informa^ 
tion in the estimation procedure may provide improved estimates of 
the parameters and may also permit estimation of parameters with a 
small sample si^e. and a small number ; of it e However , poor speci- 
fication of priors may adversely affect the estimates and hence a 
careful study of appropriate priors would be necessary* 

In conclusion^ we note that latent trait theorv offers the prom-- 
ise for solving the problems that arise in mental measurement. The 
advantages of the latent trait theoretic approach over the claMlcal 
test theoretic approach are obvious7^ It appears that the major fac- 
tors that have hindered wide spread use. of latent tral^ 
methods are the lack of familarity with these methods on the part 



= 115- / ■■ . \ ■■ \ ■ ; ■ . 

of practitionars and the lack of user oriented computer programs. 
These problems have been overcome in recent years} and henca we can 
expect latent trait theoretic procedures to emerie as methods of the 
future for the measurement of mental abilities, I 



118 




-116^ 



References 



Anderaen, E.B* Aaymptotlc properties of conditional maKimum like» 

llhood estimates. The Journal of the Royal Statist ical Society 
Series B, 1970, 32, 283-301. 

Andersen, E,B* The numerical solution of a set ^ of conditional estltna- 
tion equations. The Journal of the Royal Statist ical Society 
Series B^, 1972, 34, 42-54, 

Andersen, E*B, A goodness of fit test for the Rasch model. Psych o- 
metrika , 1973, 38, 123-140. (a) — 

todersen, E.B, Conditional Inference in multiple choice question^ 
naires, British Journal of Mathematical and Statistical Psy- 
chology* W 

Anderson, J,, Kearney, G.E., Everett, A.V, An evaluation of Rasch's 
structural model for test items. British Journal of Mathematical 
and Statistical Ps ychology. 1968, 21, 231-238/ " T 



Angoff, W.H, Scales, norms, and equivalent scores. In R,L, Thorn- 
dike (Ed*), Educational Measurement , Washington: A^ urlcan 
Council on Education, 1971, 

.Jirnbaum, A, Some latent trait inodels and their use In Inferring an 
examinee's ability* In F,M. Lord fi^ M,R, Novlck, Statistical 
Theories of Mental Test Scores , Reading , WA i Addlson-Wesley 
1968, : 

^^^^^^^^^^^1 theory for logistic mental teat models with 
a prior distribution of ability. Journal of Mathematical Psy- 
chology , 1969, 6, 258-276 , '~~ ~ — 

Bock, R.D, Esttoating item parameters and latent ability when re^ 
sponaea are scored in two or more nominal categories. Psycho- - 
met-rlka , 1972, 37^, 29-51. 



Bock,. R^D., and Llabermann, M, .. Fitting a response model for n dlco-\- 
tomously scored Items, Psychometrlka , 1970^ 35, 179-197, 

Bock, R.D,, and Wood, R* Test theory, Amual Review a£ Psyc holORV- 
1971, 22, 193-224* ' . -■ ■- ■• • " - 

Bradley 5 J, B. DlstrlbutlQn'-free Statistical Tests , Englewood Cliffs, 
NJi Prentlce-Hall, 1968,^v ■ " - 

Brigman, S,L,, and Bashaw, W,L^^ test equating using the 

Rasch model, A paper presented at the annual meeting of AERA, 
: San Francisco, 1976, " 



119 



-117' 



Choppini B,H» Recent developments in Item banking % A review^ in 
DeGruijter & 'L,J* Th. van der Kamp (Eds,), Advances in Vbjt 
chologlcal and Educational MeasureTnent . New Yorki Wilev ^^1976 , 

Clearyj T.A** Linn, R.Lt* & Rock, D. A* An exploratory study of 

programmad testa. Educational and Psychological Measu rGment, 
1968, 28, 345-360. ~ ~~~ 

CoffTnanj W.E. A factor analysis of the verbal sections of the sfiho- 
lastlc aptitude test. Research Bulletin 66-30 . Princeton. NJi 
Educational Testing Service, 1966, 

Fischer, G.H* Some probabilistic models^ for measuring change* In 
D. DeGruijter^ L.J. Th, van der kamp (Eds.), Advances in Psy- 
chological and Educational Measurement , New Yorki Wiley, 1976* 

Green, B.F. Cotranents on tailored testirfg.^^.^ In W,H. Holtzman (Ed. )^ 
Computer-assisted Instruction, TestinaV and Guidance . ^ New 
York: Harper and Row, 1970. 



Haley, D.C, Estimation of the dDsage mortality relationship vhen 
the dose la subject to error. Technical Report No , 15 , Stan- 
ford, CAr Applied Mathematics and Statistics Laboratory, 
Stanford University, 1952, 

Hambleton, R.K, An empirical investigation of the Rasch test theory 
model. Unpublished doctoral dlssertatiotii University of Toronto, 
1969. 

Hambleton, R,K. Contributions to criterion-referenced test theory: 
On the uses of item characteristic curves and related concepts. 
Laboratory of Psychometric and Evaluative Research Report 
No.~51, Amherst MAT School of Education, University of Mass^ 
achusetts, 1977. 

nambleton, R.K., and Cook, L. Latent trait models find their use In 
the analysis of educational test data. Journal of Edtiicatlonal 
Measurement ^ 1977, 14 , in press. ~ ~ 

Hambleton, R.K. , and Novick, M,R, Toward an integration of theory 
and method for criterlon^referenced tests. Journal of Educa- 
tional Measurement , 1973, 10^, 159-'170. 

...... - 

Hambleton, R.K., and Rovinelli, R, A FORTRAN IV program for gener*- 
ating examinee response data from logistic test models. Behav- 
ioral Science , 1973, 18, 74, 

Hambleton, R.K. , and Traub, R.E. Information curves and efflclancy 
of three logistic test models, British Journal of Mathematical 
and Statistical Psychology , 19717 24," 273-281. ^ ~ ^ 

Hambleton, R.K,, and Traub, R,E. Analysis of empirical data using 
• two logistic latent trait models. British Journal of >te thema- 
tlcal and Statistical Psychology , 1973V 26, 195--211. 



EKLC 



120 



-118- 



Hambleton, R.K., and Traub, R.E. The robustness of the Rasch test 

[ ^odBl. Laboratory of Psyehomatric and Evaluative Research 
Report No. 42. Amherst, MAj School of Education, University 
of Massachusetts, 1976. - 

Hunter, J.E. A Critical analysis of the use of item means and item- 
test correlations to determine the presence or absence of con- 
- tent bias in achievement test items. Paper presented at the 
National Institute of Educat ion;' Conference on T est Bias 
Annapolis, Maryland, 1975, ~ " ~ ~~ — ~"' 

Jensema, C.J. An application of latent trait mental test theory. 
British J ournal of Mathematical and S tatistical Psycholoav 
1974,^, 29-48. (a) ' ' ' 

Jensema, C.J. The validity of Bayesian tailored testing. Iduca- 
tlonal and Psvchologlcal Measurement . 1974, 34, 757-766. (b) 

Jensema, C.J. A simple technique for estimating latent trait mental 
test parameters. ' Educational and Psyeh oloalcal MeaanrpmonK 
1976, 36, 705-715. ~~ — ' — " — ' 

Keats, J. A. Test theory. Annual Review of Psychology. 1967 16 
217-238. ~ ~" " — — ' — ' 



Kendall, M.G., and Stuart, A. Advanced Theo ry o f Statistics. Vol. 
ri. New York: Hafner Publishing Co. , 1973. ' ^' 

Klefer, J., and Wolfowitz, J. Consistency of the maximum likelihood 
estimates in the presence of infinitely many Incidental para- 
meters. ^nals_j|_^themtical_Stati^^ 27, 887-890. 

Kolakowskl, D. , and Bock, R.D. A FORTRAN IV proRram for maximum 

likelihood item analysis and test scoring.- Normal ogive model. 
Education al Statistics Laboratory Research Memo No. 12. Chicago 
University of Chicago, 1970. ~~~~~~~~~ — / 

Lawley, D.N. On problems connected with item selection and test 

Proceedings of the^Royar Society o f Edinburgh. 
1943, 61, 273-287. ~ — — ' 

Lawley, n.N. The factorial analysis of multiple Item tests. Pro- 
ceedlnsB of the Royal Society of Edinburgh . 1 94 4 , 6 2- A , 74'-B2 . 



Prediction Stouf f er et^. , Measurement and 

Prediction . Princeton.. Princeton University ¥ress, 1950. ^ 



ERIC 



-"f"?' lo^^ht 

^^"^^mode^'^^'T ^''^ ^ Bayesian estimates for the linear 

model. Journal of the Royal Statistical ?alo j^"" - - - 

- : Jti°n f several programmed testing methods. Educatlonal^a^^ oV 
.. . Psychologica l Measurement . 1973- 32 , 85-95 " " ■ -12 1 



-119- 



Lord, F.M. A theory of test scorei. Psychometri e Monograph. 1952 

■...No. .7. . . , ,", : . . / . "... ' 

Lord, F.m; M application of confidence intervals and of maximum 

likelihood to the estimation of an axaminee-s ability. Psyche 
V inatrika , 1953, 18, 57-75. (a) ~" 

Lord, P.M. The relation of test score to the trait underlying the 
test* Educational and Psychological Measurement , 1953^ 13; 
517-548. (b) 



Lordj FiM. An analysis of the Verbal Scholastic Aptitude Test using 
Birnbaum's three-parameter logistic model. Educational and Psy- 
chologlcal Measurement , 1968, ^i* 989^1020. ~~ ~ 

Lord, F.Ms Estimating Item characteristic curves without knowledge 

of their mathematical form, Psychometrlka ,1970, 35, 42-50. (a) 

. " ■ . ■ . - . . ■ ' ,' ' , ■ ; . ■ ■ . ■ . . '■ ^ i f ■■ . : 

Lord, F.M. Some test theory for tailored testing. In W»H. HoltEman™ 

(Ed.) , Computer-Assisted Instruction, Testing, and Guidance . 
New York : Harper and Row, 1970. (b) 

Lord, F.M. A theoretical study of two-stage testing* Psychometrika , 
1971, 36, 227-242, (a) . " " 

Lord, F.M. Robbini ^ Monro procedures for tailored testing* Educa- 
tional and Psychological Measurement , 1971, 31, 3-31* (b) 

Lord, F.M. The self-scoring fleKllevel test. Journal of Educational 
Measurement , 1971, £,147-151. (c) . : ; — _ 

Lord P.M. A theoretical study of the measurement ef £ ectiveneas 

of flexilevel tests. Educ^tjLonal and Psychological Measurement ^ 
1971, 805-8 Wv- C ' " T~" 

Lord, F.M* Power scores estimated by Item characteristic curves^ 
Educational and Psychological Measurement , 1973, 33, 219-=224. 

Lord, P.M. Estimation of latent ability and item parameters when 

there are omitted responses. Psychometrlka , 1974, 39, 247-264. 
(a) 

Lord, P.M. Individualized testing and Item characteristic curve 

theory. In D.H. Krantz, R*C* Atkinson, R.D. Luce, a P. Suppes 
(Eds.), Contemporary Developments in Mathematical Psvchology , 
Vol. II . San Francisco I Freeman, 1974. (b) 

Lord, F.M. Quick estimates of the relative efficiency of two tests 

as a function of ability level. Journal of Educational Measure- 
ment , 1974, 11, 247-'254. (c) ' 

Lord, F.M. The relative eff Iciancy^of two tests as a function of 
ability level. Psychometrlka , 1974 , 39, 351-358* (d) 

- - " ■ - ' - 122 ' ■-- ■ . 



-120- 

Lord, F.M. A aurvey of equatini methods based on Item charaeteris- 
tlc curve theory. Research Bulletin 75-13 . Princeton, NJ: 
Educational Testing jervice, 1975, (a) 

Lord, F.M. Evaluation with artificial data of a procedure for esti- 
mating ability and item characteristic curve parameters . Re- 
search Bullatln 75-33 . Princeton, NJ: Educational TestlnE~ 
Service, 1975. Cb) 

Lord, F.M, Relative efficiency oi number-right and formula scores. 
British Journal of Mathematical a nd Statistical Psvcholoev 
1975, 28, 46-50. (c) ~ ; ' ^ ' 

Lord, F.M. The 'ability' scale in item charaatftristic curve theory 
Psychometrlka . 1975, 44, 205-217. (d) 

Lord, F.M. A study of Item bias using item characteristic curve 

theory. Paper presented at the Third International Association 
for Cross-cultural Psychology Congress . Tilbure University. 
Tilburg, the Netherlands, 1976. 

Lord, F.M. A broad-range tailored test of verbal ability. Ap plied 
Psychological Measurement . 1977, I, 95-100, (aT ~- ~" 

Lord, F.M. Practical appHcatlons of Item characterlstlG curve 

theory. Jgurnal of Educational Measurement , 1977. 14, in press, 
(b) ~~ 



Lord, F.M. , and Novick, M.R. Statistical Theories of Mental Tes t 
Scores . Reading, MA: Addison-Wesley, 1968. 

Loret, P.G., Sedsr, A., Blanchlnl, J.C., & Vale, C.A. Anchor rest 
study I Equivalence and norms tables for selected reading 
achievement tests (Grades 4, 5, 6). Washington, US Department 
of Health, Education, and Welfare, US Office of Education. 
1974. * 

Lumsden, J. The construction of unidlmensional tests. Piycholo^- 
ical Bulletin . 1961, 58, 122-131. 

Lumsden, J. Test theory. Annual Review of PsycholoEV. 1976 27 
251-280. _ J . 

Man-,), f;. The appllmUon ul' J Loin cluirucLurlBLlc curvu muthodoloKy 
to practical testing problems. Journal of Educational Measure- 
ment , 1977. in press. - ■ - ~' : ' ■ — — 

McDonald, R.P., and Ahlawat, K.S. Difficulty factors in binary data. 
British Journal of Mathematical and Sta tistical Psychology 
1974, 27, 82-99. ; 

Mead, R. Assessing the fit of data to the Raach iBOdel. Paper presen- 
ted at the annual meeting of the American Educational Research 
Association, San Fracisco, 1976. 




-121-^ 



Meredith, W, , and Kearns, J* Mpirical Bayes point estimates of 

latent trait scores without knowledge , of the trait distribution. 
Psychometrika , 1973, 3B^^ 533-554* 

Millmanp J. Criterion-^ref eranced measurefnent. In W.J, Popham (Ed , ) , 
Evaluation in Educationi Current Practices, Barkeley, CA: 
McCutchan Publishers^ 1974, 

Mulaik, S,A. The FQundations of Factor Analysis * New York* McGraw 
Hill, 1972* ~ = 

Neyinan, J*, and Scottj E^L. Consistent estimates based on partially 
consistent observations, Econometrika ^ 1948, 16, 1-5. 

Novickj M,R. , and Jackson, P. Statistical Methods for Educational 
and Psychological Research , New Yorki McGraw Hill Book" Co, ^ 
1974. 

Owen^ R. A Bayesian sequential procedure for quantal response in 

the context of adaptive mental testing. Journal of the Ameri- 
can Statistical Association , 1975» 70, 351^356* " " 

Panchapakesan, N. The simple logistic model and mental measurement. 
Unpublished doctoral dissertation, University of Chicago,. 1969. 

Pine, S.M. Applications of item response theory to the problem of 
.test bias. Unpublished manuBcrlpt. Minneapolis: Department . 

Pine, S.M. , and WeisSj D.J. Effects of item characteristics on test 
fairness. Research Report 76-5 » Minneapolis * Department of 
Psychology 5 University of Minnesota, 1976. 

Rao, C.R. Linear Statistical Inference and Its Application . New 
Yorki Wiley, T96S/ " " . " ■ ~~ ~' : 

Rasch, G. An item analysis which takes individual differences into 
; account. British Journal of Mathematical and Statistieal Psy- 

chology , 1966, 19, 49-57. "~ ~ . ; 

RentE, R*Rvi and Bashaw, W.L. Equating reading tests with the Rasch 
model. Volume I final report , Volume II technical reference 
tables, Athens, GA: University of Georgia g Educational Re-- 
search Laboratory, 1975* 

RentE, R.R., and Bashaw, W.L, The national raference scale for 

readings An application of the Rasch model. Journal of Educa- 
j^qnal Measurement » 1977, ^4, in press* 

Ross, J. An empirical study of a logistic mental test model, pgy- 
chometrlka , 1966, 325»340* 

Ross, J,, and Lumsden, J. Attribute and reliability. B ritish Jour - 
nal of Mathematical and Statistical Psychology , 1968," 21, 251-^ 

Z:.., /. •263."----™ ' ~ r ' ' ~~~ " ■ : 

O V' ■ : V ' - /^ • . ■ . -12 4;.. 



-122^ 



Samejlmaj F. Estimation of latent ability using a response pattern 
of graded scores. Psychometric Monograph , 1959, No* 17, 

Samejlnia^ F. A general model for free-response data. Psychometric 
Monograph , 1972, No* 18. ^~ " 

Samejimaj F. A comment on Blrnbaum's three^parameter logistic model 
In the latent trait theory. Psychometrlka , 1973, 38, 221-233. 
(a) ~ ^^^^ 

Samejlmaj F, Homogeneous case of the contitluous response model. 
Psychometrlka , 1973, 38, 203^219, (b) 

Samejlmaj Fs Normal ogive model on the continuous response level In 
the multidimensional latent space. Psychome tr ika , 1974, 39^ 
111-^121. " " ~ 

Thissen, D.M. Information in wrong responses to Raven^s Progressive 
Matrices. Journal of Educational Measurement , 1976, 13, 201-: 



Tlnsley^ H.EiA., and Dawls, R.V, An investigation of the Rasch sim- 
ple logistic model i Sample free item and test calibration. 
Educational and Psychological Measurement , 1975, 35_i 325-339, 

Torgerson, W,S. Theory and Methods of Scaling , New York i Wiley, 

__^1958,__;.^ ..... ; _ 

Urry, V, Approximations to item parameters of mental test models 
and their uses* Educational and Psychological Meas urement, 
1974, 34, 253-269. ^ 

Vale, CD., and WeisSs D.J* A study^ of computer-administered strad- 
aptlve ability testing. Research Report 74-4 , Minneapolis, mi 
Psychometric Methods Program/ Departmc^nt oF Psychology, Univer- 
sity of Minnesota^ 1974. 

Wang^ M, , and Stanley, J* Differential weighting f A review of 

methods and empirical studies. Review of Educational Research 
1970, 40, 663^705, " 

Weiss, D.J. Strategies of adaptive ability- measurement , Research 
Report 74-5 . Minneapolis, Mi Psychometric Methods Program , 
Department of Psychology, University of Minnesota, 1974, 

Weiss, DtJ. Adaptive testing research at Minnesota! Overview, recent 
results, and future directions. In C.L. Clark (Ed,)s Proceedings 
of the First Conference on Computerized Adaptive Testing , Wash- 
Ington, DCr United States Civil Service Conmilsslon,~1976, 

Whltely, S,, and Dawis, R,V, The nature of objectivity with the 
Rasch model. Journal of Educational Measurement , 1974, 11 
163»178. ^— . . — 



-123- 



Wood, R. Response-contingent tasting, Rev law of Ed ucational Research 
1973, 43, 529=544- 

Wood, Adaptive testingi A BayeEian procedure for the efficient 

measurement of ability, PrograCTmed Laarnlng and Educational 
Technology , 1976, 13, 34-48/ (a) 

Wood, H. Trait raeasurement and item banks* Tn D. DeGruijter and 
LsJ, Th, van der Kamp (Eds.)i Advances in psychological and ^ 
Educational Measurement ,, New Yorki Wiley, 1976* (b) ~ 

Wood, R.L*, and Lord, F.M. A user*s guide to LOGIST* Research Mem- - 
orandum 76-4 . Princeton, NJi Educational Testing Service, 
1976. 

Wood, R,L., Wlngersky, M.S*, ^ Lord, F.M, LOGIST: A computer pro- 
gram for estimating eKaminee ability and item characteristic 
curve parameters. Research Memorandum 76-6 , Princeton, NJ: 
Educational Testing Service, 1976, ~ 

Wright, B.D, Sample-free test calibration and person measurement. 

Proceedings of the 1967 Invitational Conference on Testing Prob- 
lems , Princeton, NJi Educatfonal Testing Service, r968* 

Wright, B*D, Solving measurement problems with the Rasch model. 
Journal of Educational Measurement , 1977^ 14, in press, (a) 

Wrlghtv-B*D^— Misunderstanding of the Rasch -model. Journal of Ed u- ^ ^ - 
cational Measurement , 1977, in press, (b) 

Wright, B,D., and Douglas, G,A, Best procedures for sample-free 
item analysis. Applied FsyGhological Measurement , 1977, 1, 
in press * . ^ 

Wright, B,D,, and Douglas, G*A. Conditional veraus unconditional 
procedures for sample-free analysis. Educational and Psycho- 
logicnl Measurement , in press, 

Wright, B.D,, and Mead, R, J, BlCALs Calibrating rating scales with 
the Rasch model. Research Memorandum No, 23 . Chicago i Statis- 
tical Laboratory, Department of Education" University of Chicago, 
1976* Ca) 



Wright, B.D*, and Mead, R, J, CALPITi Sample-free item calibration 
with a Rasch measurement model, Rgsearch Memorandum No, 18 i 
Chicago! Statistical Laboratory, Department of EducatlonT Uni- 
versity or Chicago, 1976, (b) 

Wright, B,D, , Mead, R,, & Draba, R* Detecting and correcting item 

bias with a logistic response model. Research Memorandum No, 22 . 
Chicagoi Statistical Laboratory, Department of Education, Uni- 
versity of Chicago, 1976. 



Wright, B.D* I and Panchapakesan, N, A procedure for sample-free 

item analysis. Educational and Psychological Measurement, 1969 



